Next Tokens Denoising for Speech Synthesis

Yanqing Liu*, Ruiqing Xue*, Chong Zhang, Yufei Liu, Gang Wang, Bohan Li, Yao Qian, Lei He, Shujie Liu, Sheng Zhao

Microsoft

Abstract

While diffusion and autoregressive (AR) models have significantly advanced generative modeling, they each present distinct limitations. AR models, which rely on causal attention, cannot exploit future context and suffer from slow generation speeds. Conversely, diffusion models struggle with key-value (KV) caching. To overcome these challenges, we introduce Dragon-FM, a novel text-to-speech (TTS) design that unifies AR and flow-matching. This model processes 48 kHz audio codec tokens in chunks at a compact 12.5 tokens per second rate. This design enables AR modeling across chunks, ensuring global coherence, while parallel flow-matching within chunks facilitates fast iterative denoising. Consequently, the proposed model can utilize KV-cache across chunks and incorporate future context within each chunk. Furthermore, it bridges continuous and discrete feature modeling, demonstrating that continuous AR flow-matching can predict discrete tokens with finite scalar quantizers. This efficient codec and fast chunk-autoregressive architecture also makes the proposed model particularly effective for generating extended content. Experiments on podcast datasets demonstrate its capability to efficiently generate high-quality zero-shot podcasts.

Audio Samples


So, Sarah, remote work’s still a big deal in 2025. What’s your take on how it’s changed since the pandemic? Oh, it’s night and day. Back then, it was chaos—Zoom crashes, kids screaming in the background. Now, companies have slick hybrid setups, and tools like AI assistants make it seamless. Ok You think people are happier working from home full-time? Mixed bag. Some love the freedom—no commute, yoga at lunch. Others miss the office banter. Studies show productivity’s up, but loneliness is a real issue. Exactly I read about “Zoom fatigue” evolving into “virtual burnout.” Companies are trying virtual reality offices now. Ever tried one? I did a demo last month. Felt like a sci-fi movie—avatars in a 3D office. Cool, but I’m not sold. Nothing beats a real coffee break with colleagues. Fair point. But I wonder if VR offices could bridge that loneliness gap. What’s the data saying? Early studies from 2024 showed VR boosts engagement for some, but others find it gimmicky. It’s like, are we working or playing a video game?

AI’s everywhere now, Sarah. How’s it changed your daily life? Honestly, it’s wild. My AI assistant schedules my day, filters my emails, even suggests recipes based on what’s in my fridge. Sure What about you? How’s AI in your routine? I’ve got an AI that tracks my fitness goals and nags me to drink water. But sometimes it feels... too personal, you know? Oh, totally. Like when it knows your mood from your voice and suggests meditation. Creepy, but handy. Do you trust it with your data? Mostly. I opt out of the super invasive stuff, but I’m not naive—data’s the price of convenience. What about you? Ok You’re cautious, right? Yeah, I use privacy-first AI platforms when I can. But let’s be real, AI’s running everything—traffic, healthcare, even dating apps.

Mental health’s finally getting the spotlight it deserves, Sarah. What’s changed for you personally? I started using an AI therapy app. It’s not a human therapist, but it’s great for daily check-ins. You doing anything new? Sure Like what? I joined a virtual support group. It’s amazing how open people are online. Stigma’s fading, don’t you think? Definitely. Celebrities talking about therapy, apps normalizing mindfulness—it’s a cultural shift. But access is still a problem. Right. Therapy’s expensive, and not everyone trusts AI tools yet. What’s the data on AI therapy effectiveness? Studies from 2024 show AI therapy helps with anxiety and depression for 70% of users, but it’s not a cure-all. Humans are still better for complex cases. Ok Workplaces are stepping up, though—mental health days, wellness stipends. You seeing that? Huge win. My company offers free counseling now. But globally, mental health resources are still scarce in low-income areas. Exactly If we can democratize access, it’ll change lives. Any ideas on how to scale it? If we can democratize access, it’ll change lives. Any ideas on how to scale it?

The gig economy’s still huge, Sarah. You ever do a side hustle? Yeah, I freelance as a graphic designer on weekends. Pays well, but it’s exhausting. You? I drive for a rideshare app sometimes. It’s flexible, but the pay’s inconsistent. Ok What’s the state of gigs now? It’s massive—30% of workers are gigging in some form. Platforms like Upwork and Fiverr are booming, but benefits are still a sore spot. Yeah, no health insurance or paid leave is rough. I saw some platforms offering “gig benefits” now. Legit? Kinda. Some offer portable benefits—like savings plans you take between gigs. It’s a start, but it’s not like a full-time job’s security. Sure You think gig workers will unionize to demand more? Yep, there’s traction. Gig worker collectives are popping up, pushing for fair pay and protections. It’s slow, but it’s happening. Exactly The gig economy’s here to stay, but it needs to evolve to treat workers better.

Dreaming of self-sufficiency but only have a small plot of land? Homesteading in the Hectare proves that you don't need acres to live a more sustainable life. This podcast explores the practicalities and joys of small-scale homesteading, covering topics like intensive gardening, raising backyard chickens, composting, preserving food, and generating renewable energy on limited space. We'll share inspiring stories and actionable advice for maximizing your productive potential, no matter the size of your "homestead."

Languages are more than just words; they're intricate systems reflecting culture, history, and human thought. The Language Labyrinth takes listeners on a deep dive into the fascinating quirks and complexities of different languages around the globe. From unique grammatical structures to untranslatable words, historical linguistic shifts to the impact of technology on language, each episode will unravel a new linguistic mystery, revealing the incredible diversity of human communication.

In a world increasingly dominated by digital, a quiet revolution is happening: the Analog Revival. This podcast celebrates the tactile, the tangible, and the enduring appeal of analog technologies. We'll explore the resurgence of vinyl records, film photography, mechanical watches, and other physical objects, delving into why these "old" technologies continue to captivate us, their unique benefits, and the communities dedicated to keeping them alive.

Are myths and legends still relevant in our hyper-connected world? Mythic Modernity: Folklore in the 21st Century explores how ancient tales, superstitions, and urban legends continue to shape our culture, beliefs, and even our digital interactions. From creepypastas to internet memes, local cryptids to global conspiracy theories, this podcast examines the enduring power of storytelling and how traditional folklore adapts and thrives in contemporary society.

In today's dynamic economy, a "main" job is often just the beginning. The Art of the Side Hustle is your guide to turning passions into profits, exploring diverse and creative ways to earn extra income outside of traditional employment. We'll feature interviews with successful side hustlers, discuss practical strategies for balancing multiple ventures, and offer insights into leveraging skills and hobbies to build financial flexibility and pursue entrepreneurial dreams.

Beyond the Nobel laureates and groundbreaking discoveries, countless fascinating scientific concepts and forgotten figures deserve recognition. The Unsung Science delves into the lesser-known but equally captivating stories from the world of science. Each episode will uncover an overlooked theory, a brilliant but forgotten scientist, or a quirky experiment that shaped our understanding of the universe, proving that innovation and curiosity come in many forms.

Step into the surprising world of urban foraging, where delicious and nutritious wild foods thrive in unexpected city corners. Each episode of this podcast will guide listeners through identifying, harvesting, and preparing edible plants and fungi found in parks, vacant lots, and even along sidewalks. We'll explore the ethical considerations of urban foraging, discuss its historical roots, and share recipes that transform common "weeds" into gourmet delights, empowering city dwellers to connect with nature and rediscover their local landscape.

Ever wonder why your feed looks the way it does? The Glitch in the Algorithm uncovers the hidden forces shaping our digital lives. We'll explore the biases, unexpected quirks, and societal impacts of the algorithms that power everything from social media to self-driving cars.