A lot of progress has been made when it comes to AI tools, but things still seem to be lagging when it comes to AI audio models. The fact is that the most interesting work in this direction can only come from companies willing to open their doors. And the CEO of Soul, Zhang Lu, did just that with the recent release of a long-form, multi-speaker voice generation model.
It’s called SoulX-Podcast, and its performance is impressive to say the least. But what stands out just as much is that this release could very well change how AI audio is built and shared. Moreover, it offers a glimpse of what future digital communication might feel like when machines learn to sound more human. But beyond these benchmarks, what truly makes it special?
How SoulX-Podcast Pushes the Boundaries
Well, most current voice models are built to mimic a single voice reading a short script. Very few can produce conversations among different speakers, and whether they are handling single or multi-speaker conversations, they find it hard to maintain consistent personalities and adjust emotional expression. If that is not troubling enough, things go further downhill as the clip gets longer.
Soul Zhang Lu’s model is an attempt to tackle these weaknesses. It also deserves mentioning here that SoulX-Podcast has come at a time when AI-generated audio is pushing into mainstream culture. Daily news podcasts, conversational explainers, and virtual hosts are already filling feeds. So, there is a need, and then there are glaring gaps, and SoulX-Podcast wants to tackle them all in one fell swoop.
Built for complex audio scenarios involving multiple speakers and multiple turns, in tests, Soul Zhang Lu’s model produced extended natural dialogue for more than sixty minutes. The system tracked who was speaking, when they should switch, and how the emotional tone should shift based on context. It handled humor, pauses, breaths, and even subtle non-speech sounds like throat clearing.
Multilingual Capabilities and Technical Design
What’s more, SoulX-Podcast also flaunted its efficacy in zero-shot conditions. For reference, it was only given a short audio clip, but that was enough for the model to clone a speaker’s timbre while adjusting rhythm and emphasis to match the ongoing conversation. So, this model essentially gives developers the ability to control paralinguistic elements, which are responsible for the sense of presence that creates the feel of a natural exchange.
Having said that, the model also handles multilingual and cross-dialect tasks. At this time, Soul Zhang Lu’s model supports Mandarin, English, Sichuanese, Cantonese, and additional spoken varieties. Now, multilingual support isn’t exclusive to SoulX-Podcast. But what makes this model stand out is its ability to generate dialect-specific speech even when the reference sample is provided only in one language.
Basically, this model allows synthetic conversations to carry cultural nuance rather than limiting them to generic and generalized sound patterns. And all of these amazing features are attributed to the model’s unique architecture. SoulX-Podcast uses an LLM for semantic token modeling and a flow-matching module for acoustic features. The system is built on the Qwen3-1.7B foundation, which supports language reasoning and contextual understanding that flows more naturally across extended dialogue.
Now, the effort that has gone into the creation of this model gives rise to another question: What was Soul Zhang Lu’s impetus to work on this model? After all, Soul is a social networking platform and not a Pure-Play AI Company.
Why Soul Is Investing in Voice-First AI
Yet, the social networking platform, which continues to hold the attention and loyalty of China’s Gen Z, has invested significantly in artificial intelligence. Unlike other companies that are exploring advanced voice synthesis exclusively for entertainment or productivity, Soul is using the technology as the foundation for its voice-first interaction approach.
The platform’s users rely not on appearance but on mutual interest to forge connections, and what better way to discuss these commonalities than by talking? So, voice has become one of the platform’s emotional anchors.
This was one of the primary reasons for Soul Zhang Lu’s team to start working on AI synthetic voice models. Earlier in the year, the company upgraded a full-duplex voice call model that allowed AI to manage natural conversation flow. The system continues to impress as it can respond in real time, listen while speaking, and pick up on subtle changes in pacing or mood.
So, SoulX-Podcast can be termed as the next step in the platform’s journey towards human-like voice generation through AI. Although the new model is a technical breakthrough, Soul Zhang Lu chose to open source it instead of monetizing it by locking it behind proprietary walls. Hence, the release can also be viewed as a commitment to collaborative exploration.
While SoulX-Podcast is not perfect, as no model can fully recreate human expressiveness, it does bring the field closer. As such, for Soul Zhang Lu’s platform, where voice is the foundation of social connection, this progress is practical rather than theoretical.