FreeTTSFreeTTS
  • Voices
  • Expressive Voices PRO
  • PDF to MP3
  • Pricing
  • Blog
  • About
  • Contact
  • Extension Chrome
  • Sign in

Product

  • Text to Speech
  • Voice Gallery
  • Expressive Voices PRO
  • Voice Cloning NEW
  • Lifetime Deal HOT
  • Text to Audiobook NEW
  • AI Voiceover NEW
  • PDF to Audiobook
  • SRT Subtitles
  • Chrome Extension

Use Cases

  • TTS for Seniors
  • TTS for Therapists
  • TTS for Teachers
  • TTS for ADHD
  • TTS for Lawyers
  • TTS for YouTube NEW
  • Content Creators NEW
  • TikTok Voiceover
  • TTS for Healthcare
  • TTS for Chromebook
  • For Nonprofits

Compare

  • Best Free TTS 2026
  • Best TTS for Podcasts
  • Best TTS for eLearning
  • Best TTS for Dyslexia
  • ElevenLabs Alternative NEW
  • FreeTTS vs ElevenLabs
  • FreeTTS vs Murf
  • FreeTTS vs Speechify
  • FreeTTS vs Descript NEW
  • FreeTTS vs WellSaid NEW

Languages

  • English
  • Spanish
  • French
  • German
  • Arabic
  • Japanese
  • Hindi
  • French TTS Guide
  • German TTS Guide
  • Italian TTS Guide
  • View all 75+ →

Company

  • About
  • For Nonprofits
  • Blog
  • Contact
  • Success Stories
  • Pricing
  • Developer API
  • Affiliate Program NEW

Legal

  • Privacy Policy
  • Terms of Service
  • Refund Policy
  • Cookies
© 2026 FreeTTS.org · Made with ❤️ for creators who need a voice
All systems operational
Studio · Help

Advanced Studio: how each mode fits together

Four modes. One Studio. Most users only need Single. The rest unlock when your project gets longer, has multiple characters, or needs scene-by-scene control.

Open Studio →See plans

On this page

  • The four modes
  • Which one for what
  • Common workflows
  • SSML — the 4 basic tags
  • SSML deep reference (everything else)
  • All expressive styles catalog
  • Voice tiers (Standard / Neural / HD / DragonHD)
  • Output formats (MP3 / WAV / OGG)
  • Best voices by use case
  • Common mistakes & fixes
  • Pro tips
  • Plan limits table
  • Pronunciation control
  • Voice cloning quick start
  • FAQ

The four modes

What each mode actually does

🎙️
Single
Start here
Up to 25,000 chars per click

When to use: One block of text. Voice testing. Quick clips. Tweaking pitch / rate / style per generation.

The default Studio tab. Paste up to 25k chars, pick any voice, optionally tweak style (cheerful, sad, whispering, etc.), speed, and pitch. Each click of Generate produces one MP3 saved to your history. Good for previews, short narration, social media voiceovers, and assembling longer projects with Timeline.

Output: One MP3 per click. Add it to Timeline to chain with other clips.

💬
Dialogue
Multi-voice
Up to ~10 segments per render

When to use: Conversations between characters. Multi-speaker scenes. Anything with line-by-line voice changes.

Write your script in the format `Speaker: line`, one line per row. Assign each speaker a different voice. Hit Generate and Dialogue muxes the conversation server-side — you get back ONE merged MP3 with all speakers, pauses between lines (configurable per-line), and proper turn-taking. Saves you from stitching 30 separate MP3s by hand.

Output: One merged MP3 with all speakers. Add it to Timeline to combine with other scenes.

⏱️
Timeline
Editor
Up to 30 clips per merge

When to use: Chaining multiple Single or Dialogue outputs into one continuous file with custom pauses between clips.

Timeline holds clips you already generated. After each Single or Dialogue render, click 'Add to Timeline' below the audio player. Drag clips to reorder, set pause-after for each (0–10s), preview individual clips, and merge into one continuous MP3 when ready. Useful for multi-scene projects: intro narration → dialogue scene 1 → narrator break → dialogue scene 2 → outro.

Output: One continuous MP3 with all clips and pauses baked in.

📚
Audiobook
Batch · Creator
Up to 2,000,000 chars (~24h audio) per job

When to use: Full novels. Long chapters. Course modules. Anything past Single's 25k-char cap that doesn't need per-line voice swapping.

Paste your entire chapter or novel into the Script box, give it a title, pick a voice and output format. The job runs server-side on Azure's batch synthesis API (typically half the audio length in wall-clock time — 1 hour of audio = ~30 min job). When done, you get a ZIP of per-chunk MP3s + SRT subtitles. Auto-detects SSML markup, so you can mix in <break/>, <prosody>, <emphasis>, and <mstts:express-as style="..."> tags per paragraph.

Output: ZIP of MP3s + SRTs. Available for download from the Recent batches panel.

At a glance

Which one for what

NeedUseWhy
Test a voice on real textSinglePer-click control, full style/pitch/rate sliders.
One narrator, 50+ pagesAudiobookSingle's 25K cap is the wall. Audiobook does up to 2M.
Multiple characters speakingDialogueServer-side mux into one MP3 with proper pauses.
Chain Single clips with pausesTimelineEditor-style ordering + merge.
Chain Dialogue scenes with introsDialogue → TimelineRender scenes in Dialogue, chain in Timeline.
Per-paragraph emotion (sad / cheerful / whispering)Audiobook (SSML)Wrap paragraphs in <mstts:express-as style="...">.
Custom pauses between paragraphsAudiobook (SSML)Use <break time="1.5s"/>.
Programmatic generation from scriptsDeveloper APISee /developers.

Common workflows

"I want to make…"

I have a single narrator and a 50-page chapter.

  1. Open Audiobook mode.
  2. Paste your chapter text. Add a title.
  3. Pick an HD voice (look for the ✨ filter — they read with the most natural cadence at audiobook length).
  4. Optionally add SSML — `<break time="1.5s"/>` between paragraphs, `<emphasis>` for key terms.
  5. Submit. Come back in ~30 minutes for the ZIP.

I have a story with 3 characters who all speak.

  1. Write each multi-character SCENE in Dialogue mode.
  2. Assign each speaker a distinct voice.
  3. Render the scene. Click 'Add to Timeline' below the audio player.
  4. Write narrative bridging text in Single mode and add each to Timeline too.
  5. Switch to Timeline, reorder clips, set pauses (e.g. 1.5s between scenes), then Merge.
  6. Download the single merged MP3.

I want emotional variation across paragraphs (sad scene then cheerful scene).

  1. Use Audiobook mode with SSML inline.
  2. Wrap the sad paragraph: `<mstts:express-as style="sad">your text</mstts:express-as>`.
  3. Wrap the cheerful one: `<mstts:express-as style="cheerful">your text</mstts:express-as>`.
  4. Available styles: cheerful, sad, excited, calm, whispering, angry, hopeful, friendly, terrified, shouting, unfriendly.
  5. Tags must be lowercase. Submit and download the ZIP when done.

I want to test 10 voices to find my favorite for a project.

  1. Open Single mode.
  2. Open the voice picker. Click the play button (▶) next to each voice — those previews don't count toward your monthly quota.
  3. Once you've narrowed to 2–3 candidates, paste a real paragraph from your project into Single and Generate. Now you'll hear how each handles YOUR text at your target pace.
  4. Tweak speed/pitch sliders to taste. Save your favorite via 'Save voice settings' for quick reuse later.

SSML quick reference

The 4 tags that cover 95% of use cases

SSML is the markup language inside the text input that controls pauses, emphasis, pitch, and emotion. It works in Single, Dialogue, and Audiobook. All tags lowercase only.

<break time="1s"/>Insert a pause. Use 2s, 500ms, etc.
<emphasis level="moderate">word</emphasis>Stress a word or phrase. Levels: reduced / moderate / strong.
<prosody rate="slow" pitch="-2st">text</prosody>Tempo and pitch. Rates: x-slow / slow / medium / fast / x-fast. Pitch in semitones (+2st, -2st).
<mstts:express-as style="cheerful">text</mstts:express-as>Emotional style. Options: cheerful, sad, excited, calm, whispering, angry, hopeful, friendly, terrified, shouting, unfriendly.

SSML deep reference

Everything else SSML can do

The cheatsheet above handles most projects. Below is the full surface FreeTTS supports — broken into categories so you can find what you need. All examples are copy-pastable.

Pauses & silence

Two ways to insert silence. `<break>` is the W3C standard. `<mstts:silence>` is Azure-only but gives you finer placement control.

break (time)
<break time="2s"/>

Hard pause for the exact duration. Units: ms (milliseconds) or s (seconds). Cap is 10s; longer values are clamped.

break (strength)
<break strength="strong"/>

Semantic pause length. Options (shortest → longest): "none", "x-weak", "weak", "medium", "strong", "x-strong". Use strength when you want pauses that scale with the voice's natural cadence; use time when you need exact timing.

mstts:silence (leading)
<mstts:silence type="leading" value="500ms"/>

Adds silence at the very start of the audio. Options for `type`: "leading", "tailing", "sentenceboundary", "leading-exact", "tailing-exact", "sentenceboundary-exact". The -exact variants override Azure's built-in silences instead of adding to them.

mstts:silence (between sentences)
<mstts:silence type="sentenceboundary" value="800ms"/>

Adds 800ms between every sentence in the document. Great for slow audiobook pacing or training content. Place this once at the top of your SSML, not per sentence.

Prosody — rate, pitch, volume

All three attributes accept named keywords, relative percentages, or absolute values. Combine them in a single `<prosody>` tag.

prosody (rate, named)
<prosody rate="x-slow">slowed text</prosody>

Named rates: "x-slow" (0.5x), "slow" (0.7x), "medium" (1.0x), "fast" (1.3x), "x-fast" (1.5x). Easiest to read in scripts.

prosody (rate, percent)
<prosody rate="85%">precisely controlled</prosody>

Absolute percentage (50%-200%). Use when named rates don't hit the exact pacing you want. 85% is a popular audiobook setting.

prosody (rate, relative)
<prosody rate="+20%">slightly faster than ambient</prosody>

Relative shift from the surrounding rate. Useful when you want one paragraph to be a bit faster than the rest without committing to an absolute speed.

prosody (pitch, semitones)
<prosody pitch="-2st">lower pitch</prosody>

Pitch shift in semitones (st). Range roughly -12st to +12st before voices distort. -2st makes most voices sound a touch more serious; +2st adds energy. Also accepts "+2Hz", "x-low", "low", "medium", "high", "x-high", or percentages.

prosody (volume)
<prosody volume="loud">SHOUTING WITHOUT CAPS</prosody>

Named volumes: "silent", "x-soft", "soft", "medium", "loud", "x-loud". Or use decibels (e.g. `volume="+6dB"`). Lets you bake quiet/loud passages into a single output without an audio editor.

prosody (combined)
<prosody rate="slow" pitch="-1st" volume="soft">whispered confession</prosody>

All three attributes work in one tag. Combine for distinctive effects — slow + low + soft = intimate; fast + high + loud = panic.

Pronunciation control

Three ways to fix mispronunciations. Phoneme is the most precise; sub is the easiest; pronunciation dictionary (PRO) is permanent across all your generations.

sub (alias)
<sub alias="Doctor">Dr.</sub> Smith

Replaces the displayed text with the alias for speech only. "Dr." gets spoken as "Doctor". Useful for abbreviations the engine misreads (Mr, Mrs, etc.), medical/legal shorthand, or initialisms.

phoneme (IPA)
<phoneme alphabet="ipa" ph="kəˈmɛrəθ">Camarath</phoneme>

Force-pronounce a word using International Phonetic Alphabet symbols. Best for proper nouns, fantasy names, technical terms. The displayed text is still shown in transcripts; only the audio uses your phoneme spelling.

phoneme (SAPI)
<phoneme alphabet="sapi" ph="t aw m ax t ow">tomato</phoneme>

Microsoft's own phonetic alphabet. Easier to read than IPA if you're not a linguist; only works with English voices. Each phoneme is space-separated.

phoneme (UPS)
<phoneme alphabet="x-microsoft-ups" ph="T1 OW0 M EY1 T OW0">tomato</phoneme>

Universal Phone Set — Microsoft's cross-language phoneme system. Use the `x-microsoft-ups` alphabet name. Includes optional stress markers (1 = primary, 2 = secondary, 0 = none).

lexicon (PRO)
<lexicon uri="https://your.cdn/pronunciations.xml"/>

Loads an external pronunciation dictionary (W3C PLS XML format). PRO/Creator users typically use the in-app Pronunciation Dictionary instead — same effect, no external hosting needed. Set once at the top of your SSML.

Say-as — interpret strings literally

Forces the engine to read text as a specific kind of data. Without `<say-as>`, the engine guesses ("1234" might be read as "one thousand two hundred thirty-four" or "one two three four" depending on context).

characters / spell-out
<say-as interpret-as="characters">NASA</say-as>

Reads each character individually: "N-A-S-A". Use for initialisms you want spelled letter by letter. "spell-out" is a synonym.

cardinal
<say-as interpret-as="cardinal">12345</say-as>

Reads numbers as cardinal numbers: "twelve thousand three hundred forty-five". Use when context might otherwise force digit-by-digit reading.

ordinal
The <say-as interpret-as="ordinal">3</say-as> rule

Reads as ordinal numbers: "third". Without this, "3" might be read as "three".

digits / number_digit
<say-as interpret-as="digits">2024</say-as>

Reads each digit separately: "two zero two four". Useful for years pronounced digit-by-digit, account IDs, etc.

fraction
<say-as interpret-as="fraction">1/2</say-as>

Reads as a fraction: "one-half" (English) or the locale equivalent. Works for both "1/2" and "1 1/2" formats.

date (mdy / dmy / ymd)
<say-as interpret-as="date" format="dmy">12-03-2026</say-as>

Reads a date with explicit format. Format strings: "mdy", "dmy", "ymd", "md", "dm", "ym", "my", "d", "m", "y", "yyyymmdd". Engine will say "the twelfth of March, two thousand twenty-six". Avoids the US-vs-rest-of-world confusion entirely.

time
<say-as interpret-as="time" format="hms24">15:30:00</say-as>

Reads a time. Formats: "hms12" (am/pm), "hms24" (24-hour), "ms" (minutes:seconds). The above reads as "fifteen thirty hours" / "three thirty PM" depending on locale.

telephone
<say-as interpret-as="telephone">+1-555-867-5309</say-as>

Reads phone numbers naturally with country code grouping. Strips dashes and spaces, reads digits one at a time in conventional groupings.

currency
<say-as interpret-as="currency" language="en-US">$42.50</say-as>

Reads currency with the unit name: "forty-two dollars and fifty cents". Optional `language` attribute helps when the currency symbol is ambiguous.

address
<say-as interpret-as="address">221B Baker St</say-as>

Reads street addresses with the right pacing — number before street, expanding common abbreviations (St → Street, Ave → Avenue).

Language switching mid-document

Use `<lang>` to read a foreign phrase in its native pronunciation without switching voices entirely. The voice has to support the target language for this to sound right.

lang
She said <lang xml:lang="fr-FR">bonjour mon ami</lang> with a smile.

Reads the wrapped text in the specified language using the current voice. Multilingual voices (look for "Multilingual" in the voice name) handle ~12 languages each — best for code-switching scripts. Standard voices may fall back to phonetic approximation.

voice (nested)
<voice name="en-US-AriaNeural">Hello.</voice> <voice name="es-ES-ElviraNeural">Hola.</voice>

Swap voices mid-document. Each `<voice>` block can be a completely different voice in a different language. Useful when you need an actual native speaker for the foreign passage instead of a multilingual voice's approximation. This is what Dialogue mode does for you automatically.

Expressive styles (mstts)

`<mstts:express-as>` is Azure's emotional style tag. Not every voice supports every style — see the "All expressive styles" section below for the complete catalog. Combine with `styledegree` (0.01–2.0) and `role` (for certain Chinese voices).

express-as (basic)
<mstts:express-as style="hopeful">tomorrow will be different</mstts:express-as>

Wraps text in an emotional style. The voice must support the style — Aria, Jenny, Davis (US English) have the widest catalogs; British and other locales have fewer.

express-as (style degree)
<mstts:express-as style="excited" styledegree="2">she's here!</mstts:express-as>

Style intensity. 0.01 = barely-there hint of the emotion; 1.0 (default) = normal; 2.0 = exaggerated. Use higher degrees for dramatic moments, lower for subtle inflection.

express-as (role)
<mstts:express-as style="default" role="YoungAdultFemale">line of dialogue</mstts:express-as>

Roleplay attribute — makes the voice imitate a different speaker type. Options: "Boy", "Girl", "YoungAdultFemale", "YoungAdultMale", "OlderAdultFemale", "OlderAdultMale", "SeniorFemale", "SeniorMale". Currently only some Chinese voices (zh-CN-XiaomoNeural, zh-CN-XiaoxuanNeural, zh-CN-YunxiNeural, zh-CN-YunyeNeural) support this. Pairs well with Dialogue mode for character variety from a single voice.

Structure & markers

Optional but useful for long-form audiobooks, captions, and engines parsing your output.

p (paragraph)
<p>Paragraph one. Two sentences.</p> <p>Paragraph two.</p>

Explicit paragraph boundary. The engine adds a natural pause between `<p>` blocks (slightly longer than between sentences). Useful when your text has weird line breaks the engine would otherwise misinterpret.

s (sentence)
<s>This is a sentence.</s> <s>This is another.</s>

Explicit sentence boundary. Forces the engine to treat the wrapped text as a complete sentence even without terminal punctuation — useful for fragments like "Yes." that might otherwise blend into the next.

bookmark
And then <bookmark mark="chapter-2-start"/> she opened the door.

Inserts a named position marker. Doesn't affect audio but appears in the boundary metadata exposed via /api/v1/tts. Useful when you're building a player that needs to jump to specific scenes.

Background audio (mstts)

Mix a background audio track (music, ambience, white noise) under the synthesized speech. Audio file must be publicly accessible HTTPS.

mstts:backgroundaudio
<mstts:backgroundaudio src="https://example.com/music.mp3" volume="0.4" fadein="2000" fadeout="3000"/>

Plays the source audio under the entire speech track. `volume` is 0.0–1.0 (0.4 = 40% volume — keeps speech intelligible). `fadein`/`fadeout` in milliseconds. Place inside `<speak>` but outside `<voice>`. Only allowed once per document.

Heads up on SSML rules. Three things trip up almost everyone: (1) tag names are lowercase — <Break/> won't work, only <break/>. (2) &, <, and > in your text must be escaped (&amp;, &lt;, &gt;) when inside SSML — the Studio escapes them for you in plain-text mode but not once you start adding tags. (3) DragonHD voices (the :DragonHDLatestNeural ones) reject <mstts:express-as> tags — they read emotion from context instead.

All expressive styles

Every mstts:express-as style FreeTTS supports

Not every voice supports every style. The Studio's style chip picker (in Single mode) shows what your selected voice supports. Aria, Jenny, and the Chinese voices Xiaoxiao/Yunxi have the widest catalogs. British and most localized voices typically support just cheerful + sad.

Emotion

cheerfulBright, upbeat. Default pick for happy moments and positive announcements.
sadSlower, lower pitch, downward inflection. Use for grief, regret, somber news.
angryHarder consonants, elevated volume. Adversarial dialogue, frustration.
excitedFaster, higher energy than cheerful. Big-reveal moments, action scenes.
fearfulTrembling, slightly higher pitch, breathy. Suspense and horror.
terrifiedExtreme of fearful — short bursts, rapid breaths, top of the register.
hopefulWarm, slightly tentative, rising inflection. Bridging sad → cheerful.
disgruntledAnnoyed but restrained. Sarcasm, minor complaints.
embarrassedHushed, slightly halting. Apologies, awkward moments.
seriousSteady, even-paced, low expression. News commentary, formal narration.
calmSmooth, lower energy than serious. Meditation, instructions, ASMR adjacent.

Voice quality

whisperingBreathy, very low volume. Intimate scenes, secrets, ASMR. Best paired with prosody volume='soft'.
shoutingMaximum volume + clipped delivery. Battle scenes, distant calls. Use sparingly — listening fatigue is real.
gentleSoft, mid-pitch, evenly paced. Children's storytelling, comforting.
lyricalSlight musical inflection. Poetry, song-like delivery.

Conversational

friendlyWarm and approachable. Default for tutorials, onboarding, support content.
unfriendlyCold, dismissive. Antagonist dialogue, hostile NPC.
empatheticSoft, slow, validating tone. Customer-service apologies, sensitive subjects.
chatCasual, mid-energy. Podcast-style discussion, informal updates.
assistantPolite, helpful, neutral. Built for AI assistants and voice UIs.
customerserviceProfessional friendly. Phone-system-style helpful answers.

Narration

narration-professionalClean audiobook-style narration. The default 'just read it well' choice for long-form.
narration-relaxedLooser pacing than professional. Personal essays, memoir.
documentary-narrationAuthoritative, evenly paced. Educational video voiceovers, science explainers.
newscastGeneric newscaster tone. Use the more specific casual/formal variants if available on your voice.
newscast-casualApproachable newscaster — feature segments, morning shows.
newscast-formalTraditional newscaster gravitas — breaking news, formal reports.
poetry-readingSlower pacing with deliberate emphasis on cadence. Verse, prose poetry.
advertisement-upbeatPunchy, energetic commercial reading. Product launches, promos.
sports-commentaryFaster pace, dynamic stress. Sports calls, live event narration.
sports-commentary-excitedSports-commentary cranked up — game-winning moments.

Voice tiers

Standard vs Neural vs Multilingual vs HD vs DragonHD

FreeTTS exposes 5 voice tiers under one picker. Knowing which tier a voice belongs to matters because they have different SSML quirks and different ideal use cases.

Standard

Free

Legacy concatenative voices. Recognizable robotic edge. Mostly retired in favor of Neural — kept for niche use cases.

When to use: Almost never. The free Neural tier sounds dramatically better at the same cost.

SSML: Full SSML.

Neural

Free + PRO

Standard neural voices — 400+ across 75+ languages. Natural intonation, expressive on supported styles, fast generation. The workhorse tier.

When to use: Default choice for everything. Tutorials, podcasts, video voiceover, casual audiobooks.

SSML: Full SSML including `<mstts:express-as>` for supported styles.

Multilingual

PRO

Single voice that speaks 12+ languages with the same voice identity. Best for code-switching content — your French phrase doesn't suddenly become someone else.

When to use: Bilingual narration, language learning content, scripts with embedded foreign phrases.

SSML: Full SSML. Pairs well with `<lang xml:lang="...">` for mid-sentence language switches.

HD

PRO

High-definition voices. Higher audio bitrate and noticeably more natural prosody. Look for "HDLatest" in the voice name.

When to use: Audiobook narration, professional voiceover, anywhere quality matters more than generation speed.

SSML: Full SSML support.

DragonHD

PRO · Premium

Newest-generation Microsoft voices (released 2025). Reads context naturally — automatically expresses emotion from the text without explicit style tags. Look for ":DragonHDLatestNeural" suffix.

When to use: Highest-quality narration when you want emotion to come from the writing itself, not from markup.

SSML: Most SSML works EXCEPT `<mstts:express-as>` style tags (DragonHD rejects them — it reads emotion from context instead). Use other styles like prosody and break normally.

Output formats

MP3 vs WAV vs OGG — which to pick

FormatSpecsSizeWhen to use
MP3160 kbps mono · 24 kHz~1.2 MB per minuteDefault. Universal playback, smallest files, good quality. Lossy compression. Final-mix-then-export workflows lose a touch of quality each re-encode.
WAV16-bit PCM · 24 kHz mono~2.9 MB per minuteProfessional editing in Audition, Pro Tools, Reaper. Mastering. Re-encoding to any other format without quality loss. Uncompressed. Ships at studio-friendly specs. Use this if the file goes into a DAW.
OGGOpus · 24 kHz~1.0 MB per minuteWeb playback, especially in audio elements without Safari support concerns. Smallest files at equivalent quality to MP3. Open-source codec. Modern browsers all support it; older devices may not.

Voice recommendations

Best voices by use case

Tested-and-recommended picks for common projects. Preview each in the voice picker before committing — taste is individual.

Long-form audiobook (single narrator)

  • en-US-Andrew:DragonHDLatestNeuralTop pick. DragonHD reads emotion from context — exactly what you want for fiction.
  • en-US-AvaMultilingualNeuralUse for mixed-language books. Same voice identity across English, Spanish, French, etc.
  • en-US-JennyNeuralSolid default neural. Wide expressive style support if you want manual control.
  • en-GB-LibbyNeuralBritish narration for UK-set or period-set fiction.

News, current events, factual content

  • en-US-AriaNeural with newscast-formalAuthority + clarity. Used by many news automations.
  • en-US-BrandonNeuralMale newscaster cadence.
  • en-US-DavisNeuralMid-energy, factual.

Podcast / conversational

  • en-US-GuyNeural with style='chat'Casual, mid-energy. Sounds like a real podcaster.
  • en-US-JennyMultilingualNeuralApproachable warm female voice. Multilingual flexibility.
  • en-US-Christopher:DragonHDLatestNeuralDragonHD male — natural inflection without style hacking.

Educational / explainer

  • en-US-AriaNeural with documentary-narrationStandard for science explainers and tutorials.
  • en-US-Emma:DragonHDLatestNeuralDragonHD female. Clean, even-paced explainer voice.

Children's content / storytelling

  • en-US-AnaNeuralYounger-sounding voice. Designed for kid-friendly content.
  • en-US-JennyNeural with style='gentle'Warm storytelling tone.

Multi-voice dialogue scenes

  • Mix any two contrasting voices in Dialogue modePick voices with distinctly different pitches and accents — a young female + an older male reads more clearly than two similar voices.
  • Add 'role' attribute for Chinese voiceszh-CN-XiaomoNeural, zh-CN-XiaoxuanNeural, etc. can switch between Boy/Girl/YoungAdult/OlderAdult roles for variety from one voice.

Gotchas

Common mistakes and fixes

Uppercase SSML tag names

✗ Wrong<Break time="1s"/>
✓ Right<break time="1s"/>

SSML is case-sensitive. Capitalized tags don't parse and get spoken aloud as text — or fail the whole chunk.

Unescaped `&` in text

✗ WrongTom & Jerry decided to go.
✓ RightTom &amp; Jerry decided to go.

Inside SSML, `&` starts an XML entity. The fix is `&amp;` (or `&` outside any SSML block in PlainText mode). Same goes for `<` (use `&lt;`) and `>` (use `&gt;`) inside SSML.

`<mstts:express-as>` on a DragonHD voice

✗ Wrong<mstts:express-as style="cheerful">…</mstts:express-as> (on en-US-Andrew:DragonHDLatestNeural)
✓ RightJust write expressive text. DragonHD reads emotion from context.

DragonHD voices explicitly reject `mstts:express-as` tags. Strip them out or switch to a Neural voice if you need explicit style control.

Using a style the voice doesn't support

✗ Wrong<mstts:express-as style="poetry-reading">…</mstts:express-as> (on en-US-GuyNeural)
✓ RightCheck the per-voice supported styles. Aria has the most; British voices typically only cheerful + sad.

Unsupported styles are silently ignored — the audio plays neutral instead of poetic. Use the Studio's chip picker to see which styles your selected voice supports.

Forgetting the mstts namespace

✗ Wrong(in raw SSML files outside the Studio)
✓ Rightxmlns:mstts="http://www.w3.org/2001/mstts" on <speak>

If you're writing raw SSML for the API, the `<speak>` root must declare the mstts namespace before any `mstts:` tag will parse. Studio's Audiobook tool wraps your text automatically — this only matters for direct API users.

Putting `mstts:backgroundaudio` inside `<voice>`

✗ Wrong<voice name="..."> <mstts:backgroundaudio .../> ... </voice>
✓ Right<speak> <mstts:backgroundaudio .../> <voice name="...">...</voice> </speak>

Background audio is per-document, not per-voice. Place it directly inside `<speak>`, before the `<voice>` block. Only one allowed per document.

Expecting `<audio src='...'/>` to work

✗ Wrong<audio src="https://example.com/clap.mp3"/>
✓ RightAudiobook batch does not support inline audio injection. Use Timeline mode to chain pre-generated clips, or `mstts:backgroundaudio` for a single underlay track.

Azure's real-time API supports inline `<audio>` but the batch synthesis path we use for Audiobook mode does not. Plan accordingly for long-form projects.

Pro tips

Power-user shortcuts

  • Voice previews are free. The play button next to each voice in the picker uses a separate quota-free endpoint. Audition 50 voices before committing to one — none of it counts toward your monthly chars.
  • HD voices cost the same as Neural in terms of char usage, but Azure synthesis is ~20% slower. For interactive Single-tab work that's invisible. For 2M-char Audiobook batches it adds 5-10 minutes — usually still worth it.
  • Audiobook batches run server-side. You can close the tab. The job finishes whether you watch or not. Recent batches panel shows progress on next visit.
  • Concurrent batch limit is 3 per user. Submit more than three and the 4th returns 429 until one finishes.
  • Split very long batches deliberately. A 2M-char batch generates one ZIP. Submit per-chapter batches instead if you want chapter-level downloads — easier reorganization later.
  • Save your favorite voice setup. In Studio Single mode, click 'Save voice settings' to bookmark a voice + style + rate + pitch combination. Cuts the testing-then-rebuilding loop on multi-day projects.
  • Add to Timeline after every render. It's cheaper than regenerating later — if you might want to chain clips, save them as you go instead of looking for them in History.
  • Pronunciation dictionary is permanent. PRO/Creator can add `word → IPA` mappings that auto-apply across all generations. Much cleaner than wrapping every instance of 'Camarath' in `<phoneme>` tags.
  • Free tier: 1,000 chars/generation. PRO: 10,000. Creator: 25,000. Past those caps, the only option is Audiobook batch (up to 2,000,000).

Hard caps

Plan limits at a glance

LimitFreePRO ($19/mo)Creator ($39/mo)
Per-generation chars (Single)1,00010,00025,000
Monthly chars (total)5,0001,000,0005,000,000
Audiobook batch (per job)—2,000,0002,000,000
Concurrent batch jobs—33
HD voicesPreview only✓✓
DragonHD voicesPreview only✓✓
Multilingual voicesPreview only✓✓
Dialogue mode—✓✓
Timeline editor—✓✓
Audiobook batch——✓
Voice cloning——1 clone · 100K cloned chars/mo
Pronunciation dictionary—✓✓
Generation history—30 days30 days
Commercial license—✓✓
WAV / OGG exportMP3 only✓✓
SRT subtitle export✓✓✓
Free API key—✓✓

Pronunciation

Fixing words the voice mispronounces

Three escalation paths — pick the smallest one that works for your case.

Quickest: rewrite phonetically

If a word is mispronounced once or twice, rewrite the spelling so the engine reads it correctly. Camarath → kuh-MARE-uth. Ugly in the transcript but fixes one-offs without any SSML.

One-shot fix: <sub alias="...">

Replace what the engine sees with what you want it to say.

Dr. <sub alias="Watson">Watson</sub> said hello.

Use this for abbreviations and initialisms the engine misreads. The transcript still shows Dr.; only the audio uses Doctor.

Precise: <phoneme alphabet="ipa">

Force exact pronunciation with International Phonetic Alphabet symbols.

<phoneme alphabet="ipa" ph="kəˈmɛrəθ">Camarath</phoneme>

The ph attribute holds the IPA. The displayed text is preserved; only the audio uses your spelling. If you don't know IPA, the dashboard's Pronunciation Dictionary has a "Hear it" button that lets you audition different IPA strings until one sounds right.

Permanent: Pronunciation Dictionary PRO

Add Camarath → kəˈmɛrəθ once in the dashboard (Studio settings → Pronunciation). Every future generation across all modes uses it. Much cleaner than wrapping every instance in <phoneme>. Up to 500 mappings per account.

IPA cheat sheet for common English sounds

SoundIPAExample word
"a" in catæcæt
"a" in fatherɑfɑther
"e" in bedɛbɛd
"i" in machineimishin
"i" in bitɪbɪt
"o" in noteoʊnoʊt
"u" in moonumun
schwa (the most common vowel)əəbout
"th" in thinθθin
"th" in thisððis
"sh" in shoeʃʃoo
"zh" in measureʒmeaʒure
Primary stress (placed before stressed syllable)ˈcˈamera

Voice cloning

Creator-tier voice cloning — quick start

Train a voice that sounds like you (or any voice you have rights to record) from 60 seconds of audio. Once trained, your custom voice appears in the picker like any other voice — works in Single, Dialogue, Timeline, and Audiobook.

What you need

  • 60 seconds of clean audio of one person speaking naturally
  • No background music, no echo, no overlapping voices
  • USB mic or phone with a quiet room is enough — studio gear unnecessary
  • Variety: read a paragraph or two with a mix of question, statement, exclamation
  • WAV or MP3 file, 16 kHz or higher sample rate

How to clone

  1. Open Dashboard → Voice Cloning.
  2. Click "New voice" and give it a memorable name.
  3. Upload your 60-second audio clip. Or record live in the browser.
  4. Wait ~5 minutes. Status flips to "Ready".
  5. Open Studio → your cloned voice shows in the picker under a "My voices" section.

Quality tips

  • Sample variety beats length. 60 seconds of varied prose beats 5 minutes of monotone reading.
  • Match the use case. If you'll narrate audiobooks, train on a paragraph being narrated. If you'll use it for dialogue, train on conversational delivery.
  • Avoid plosives. Pop filter on the mic, or position it slightly off-axis (5-10 degrees) so "p" and "b" bursts don't spike.
  • Clean room. Hard reverb (bathroom, empty kitchen) makes the clone sound boxy. Soft furnishings, closet with hanging clothes, or under a blanket beats a fancy mic in a bad room.

Limits & quotas

  • 1 active clone per Creator account. Delete your existing clone to make room for a new one.
  • 100,000 cloned characters/month dedicated quota (separate from your regular 5M).
  • Cancel an unused clone any time to free a slot.
  • Cloned voices count toward the same per-generation char cap as regular voices (25,000 in Single, 2M in Audiobook).

Ethical note

Only clone voices you own (yourself, voice actors who've consented in writing, public-domain recordings older than ~95 years). FreeTTS audits for likely-unauthorized clones and removes them. Cloning a celebrity, a podcast host, or any non-consenting voice violates the terms of service.

FAQ

Common questions about Studio

Which mode should I start in?

Single. It has the gentlest learning curve — type text, pick a voice, hit Generate. Most users only need Single to find their preferred voice and pacing. Move on to Dialogue when you need multiple characters, Timeline when you want to chain clips, and Audiobook when your project is too long for Single's 25,000-char cap.

Does Single tab go through my monthly character budget?

Yes. Every Generate click in Single counts. Voice previews (the small play button next to a voice in the picker) do NOT count — those use a separate quota-free endpoint. Test as many voices as you want via Preview without burning chars.

Dialogue made one MP3 with all speakers. Can I get the per-speaker files separately?

Not directly. Dialogue muxes the conversation server-side and returns one merged file. If you need separate files per speaker, generate each speaker's lines in Single mode (switching voices between them), then add each one to Timeline for chaining. It's slower but gives you full control over the individual audio files.

What's the difference between Timeline and Audiobook?

Timeline assembles audio you already generated. Audiobook generates new audio from text. Use Timeline when you want to merge specific clips (e.g., chain three dialogue scenes plus a narrator intro). Use Audiobook when you want to give it a full chapter and get back a finished ZIP. They serve different jobs — Timeline is editor-style, Audiobook is batch-pipeline-style.

Why does Audiobook return a ZIP instead of one MP3?

Audiobook splits long text into chunks (Azure's batch API caps each chunk at ~10,000 chars). The ZIP contains one MP3 per chunk plus SRT subtitles. If you want a single merged file, you can either run the chunks through Timeline (add them all + merge) or use an external tool like ffmpeg / Audacity.

Can I use SSML markup in Single tab too?

Yes — Single and Dialogue both accept SSML. The most useful tags are <break time="1s"/>, <emphasis level="strong">, <prosody rate="slow">, and <mstts:express-as style="cheerful">. Audiobook auto-detects SSML too (as of May 2026). All tags are lowercase — <Break/> with a capital B won't work.

Why is my Audiobook generation count showing 1 per batch instead of one per chunk?

Each Audiobook submission writes a single row to your history with the title and total chars, not one row per internal chunk. The History table's Source column will label it 'Audiobook' so you can tell it apart from Single (web), Dialogue (studio), and API entries.

How long does an Audiobook take to render?

Azure's batch service typically completes in roughly half the audio's playback length. A 1-hour audiobook takes ~30 minutes. Watch the dashboard 'Recent batches' panel for live progress. The job runs server-side, so you can close the tab — you'll see the finished link on next visit.

What's the difference between HD voices and DragonHD voices?

HD voices (like en-US-Aria:HDLatestNeural) are higher-bitrate versions of standard neural voices — same expressive control via mstts:express-as, just better audio quality. DragonHD voices (like en-US-Andrew:DragonHDLatestNeural) are a newer-generation model from Microsoft that automatically reads emotion from the text itself. You write "She gasped in horror" and DragonHD delivers it dramatically without you needing the terrified style. The catch: DragonHD rejects mstts:express-as tags entirely, so manual style control is gone. Use HD when you want explicit control; use DragonHD when you want natural delivery from natural writing.

Why does my style tag get ignored?

Three common reasons. First, your voice doesn't support that style — only certain voices have specific styles (the Studio chip picker shows what your selected voice supports). Second, you're on a DragonHD voice, which rejects express-as tags entirely. Third, you used the style outside an mstts:express-as wrapper. Audiobook auto-wraps the SSML envelope for you, so this only happens if you're writing raw SSML for the API.

Can I use SSML in the homepage box (not Studio)?

Yes. The homepage /api/tts endpoint and the browser extension both build a full SSML envelope from your text. Drop in <break/>, <emphasis>, <prosody>, even mstts:express-as — they all work on the homepage if the underlying voice supports them. The 1,000-char free-tier per-generation cap still applies though.

What's styledegree and when should I change it?

styledegree controls how intense the emotional style is. Default is 1.0. Range is 0.01 (barely-there hint) to 2.0 (cranked up). For most narration the default works fine. Bump to 1.5 or 2.0 for dramatic moments (a battle cry, a grief explosion). Drop to 0.5 for subtle inflection (a hint of sadness under a brave face). Add it as an attribute: <mstts:express-as style="sad" styledegree="0.5">.

How do I make a pause longer than 10 seconds?

Azure caps `<break time>` at 10 seconds per tag. For longer silences, stack multiple breaks: <break time="10s"/><break time="10s"/><break time="5s"/> gives you 25 seconds. Or use mstts:silence at the top of your document to set a global between-sentence silence if you just want spacious pacing throughout.

Does Studio support voice cloning?

Yes, on the Creator plan. You record 60 seconds of clean audio of yourself talking naturally (no background noise, normal pace), upload it from /dashboard?section=voice-clone, and the system trains a personal voice in about 5 minutes. Once cloned, your voice appears in the picker like any other voice. Creator includes 1 voice clone and 100,000 cloned characters per month. Quality scales with the recording — invest in 60 seconds of really clean audio and the clone will sound impressively close.

What happens if my SSML is invalid?

Different paths handle invalid SSML differently. Real-time TTS (Single, Dialogue) returns a clear error in the UI telling you what's wrong — usually with the line position. Audiobook batch is less forgiving — if one chunk has invalid SSML, that whole chunk fails and is skipped (the rest succeed). Test SSML in Single mode first if you're adding it to a long batch — much faster feedback loop.

Can I generate audio in two languages in one file?

Yes, three ways. Easiest: pick a Multilingual voice (look for "Multilingual" in the voice name) which can speak ~12 languages with one identity. Most precise: use <lang xml:lang="fr-FR">phrase</lang> inside any voice to switch languages mid-sentence for that span. Most flexible: use Dialogue mode and assign each language to a different voice — you get separate native speakers for each language.

Why does the same voice sound slightly different on different generations?

Neural voices have a small amount of natural variation — the same text rendered twice won't be byte-identical even with the same voice and settings. Variation is normally subtle (different breath placement, slight rhythm differences). Larger differences usually come from context — the voice reads a question differently from a statement, an exclamation differently from a declaration. This is a feature, not a bug, and is part of what makes neural voices feel human.

What's the Pronunciation Dictionary?

A PRO/Creator feature that stores custom word → pronunciation mappings that auto-apply to ALL your generations. Add 'Camarath' → 'kəˈmɛrəθ' once, and every time you write Camarath in any Studio mode it gets pronounced correctly. Much cleaner than wrapping every instance in <phoneme>. Find it in the dashboard sidebar under Studio settings. Uses International Phonetic Alphabet (IPA) — there's a 'Hear it' button to audition before saving.

Why is my audiobook output a ZIP of chunks instead of one merged MP3?

Azure's batch synthesis caps each chunk at ~10,000 chars to keep individual jobs fast. A 200K-char audiobook becomes ~20 chunks, each delivered as a separate MP3 in the ZIP along with SRT subtitles. To merge into a single MP3: extract the ZIP, then either (a) use Studio's Timeline mode to add each chunk and merge, or (b) use ffmpeg locally: `ffmpeg -f concat -i list.txt -c copy output.mp3` where list.txt contains "file 'chunk1.mp3'" lines. The ZIP keeps things flexible — chapter-level files are easier to edit and reorganize than one giant MP3.

Can I set per-segment voices in Dialogue mode without writing SSML?

Yes — that's exactly what Dialogue is for. Write your script as `Speaker: line of dialogue` (one per row). Assign each speaker a voice in the picker. Dialogue mode auto-generates the multi-voice SSML for you. Per-line style overrides are also supported via the small style dropdown next to each line. You only need to drop into raw SSML if you want effects beyond voice + style (custom pauses, prosody, etc.).

What's the audio quality difference between MP3 / WAV / OGG?

All three are generated from the same 24 kHz neural synthesis source. MP3 (default, 160 kbps) is universal and small. WAV is uncompressed PCM — same source quality, ~2.5x file size, ideal for re-editing in a DAW. OGG/Opus is the most modern, smallest at equivalent quality, but older browsers and devices don't support it. Pick MP3 unless you have a specific reason to use the others.

More from FreeTTS

While you're here

Open Advanced Studio
Apply what you just read.
Voice Gallery
Preview 400+ voices before you commit.
Make TTS Sound Natural
10 techniques that fight robotic delivery.
Developer API
Same engine, programmatic access.