Studio · Help

FreeTTS Studio: how the pieces fit together

Two modes — Single voice and Dialogue — plus a Timeline to chain takes and regenerate-one-line to fix a single wobble. Most projects only need Single voice; the rest unlock when your project gets longer, has multiple characters, or needs scene-by-scene control.

Open Studio →See plans

Modes & panels

What each mode actually does

Single voice

Start here

Free 1,000 · PRO 10,000 · Creator 25,000 chars per render

When to use: One block of text. Voice testing. Quick clips. Tweaking pitch / rate / style per generation.

The default Studio mode. Type or paste your script (up to your plan's per-render cap), pick any voice, optionally tweak style (cheerful, sad, whispering, etc.), speed, and pitch. Each Generate produces one MP3 saved to your history. Paid plans can 'Regenerate by paragraph' to fix a single line in place. Good for previews, narration, social voiceovers, and assembling longer projects in the Timeline.

Output: One MP3 per render. Add it to the Timeline to chain with other clips.

Dialogue

Multi-voice

Up to ~10 segments per render

When to use: Conversations between characters. Multi-speaker scenes. Anything with line-by-line voice changes.

Write your script in the format `Speaker: line`, one line per row. Assign each speaker a different voice. Hit Generate and Dialogue muxes the conversation server-side — you get back ONE merged MP3 with all speakers, pauses between lines (configurable per-line), and proper turn-taking. Saves you from stitching 30 separate MP3s by hand.

Output: One merged MP3 with all speakers. Add it to Timeline to combine with other scenes.

Timeline

Overlay · PRO

Up to 30 clips per merge

When to use: Chaining multiple Single voice or Dialogue outputs into one continuous file with custom pauses between clips.

An overlay panel (not a separate mode). It holds clips you already generated. After each Single voice or Dialogue render, click 'Add to Timeline' below the audio player. Reorder clips, set pause-after for each (0–10s), preview individual clips, and merge into one continuous MP3 when ready. Useful for multi-scene projects: intro narration → dialogue scene 1 → narrator break → dialogue scene 2 → outro.

Output: One continuous MP3 with all clips and pauses baked in.

At a glance

Which one for what

Need	Use	Why
Test a voice on real text	Single voice	Per-render control, full style/pitch/rate sliders.
One narrator, 50+ pages	Single voice + Timeline	Render in chunks under your per-render cap, then chain them in the Timeline.
Multiple characters speaking	Dialogue	Server-side mux into one MP3 with proper pauses.
Fix one weak line without redoing the render	Regenerate ↻	Re-render just that line as a fresh take; it re-stitches automatically.
Chain clips with pauses	Timeline	Reorder + set pauses + merge into one MP3.
Per-paragraph emotion (sad / cheerful / whispering)	Single voice / Dialogue (SSML)	Select text, right-click → Emotion, or wrap in `<mstts:express-as style="...">`.
Custom pauses between paragraphs	Single voice / Dialogue (SSML)	Right-click → Pause, or use `<break time="1.5s"/>`.
Programmatic generation from scripts	Developer API	See /developers.

Common workflows

"I want to make…"

I have a single narrator and a 50-page chapter.

Open Single voice mode and pick an HD voice (look for the ✨ filter — they read with the most natural cadence at length).
Render your text in chunks that fit your per-render cap (free 1,000 · PRO 10,000 · Creator 25,000). Add `<break time="1.5s"/>` between paragraphs and `<emphasis>` on key terms via right-click.
Click 'Add to Timeline' after each chunk.
Switch to the Timeline overlay, set the pause after each clip, then Merge into one continuous MP3.

I have a story with 3 characters who all speak.

Write each multi-character SCENE in Dialogue mode.
Assign each speaker a distinct voice.
Render the scene. Click 'Add to Timeline' below the audio player.
Write narrative bridging text in Single voice mode and add each to the Timeline too.
Open the Timeline overlay, reorder clips, set pauses (e.g. 1.5s between scenes), then Merge.
Download the single merged MP3.

I want emotional variation across paragraphs (sad scene then cheerful scene).

In Single voice, select the sad paragraph, right-click → Emotion → sad (or type `<mstts:express-as style="sad">your text</mstts:express-as>`).
Select the cheerful paragraph and apply cheerful the same way.
Available styles: cheerful, sad, excited, calm, whispering, angry, hopeful, friendly, terrified, shouting, unfriendly.
Tags must be lowercase. Generate — and if one paragraph lands wrong, use 'Regenerate by paragraph' to fix just that line.

I want to test 10 voices to find my favorite for a project.

Open Single voice mode.
Open the voice picker. Click the play button (▶) next to each voice — those previews don't count toward your monthly quota.
Once you've narrowed to 2–3 candidates, paste a real paragraph from your project and Generate. Now you'll hear how each handles YOUR text at your target pace.
Tweak speed/pitch sliders to taste, then 'Save project' to reuse the whole setup later.

SSML quick reference

The 4 tags that cover 95% of use cases

SSML is the markup language inside the text input that controls pauses, emphasis, pitch, and emotion. It works in both Single voice and Dialogue — select text and right-click (or long-press) to insert tags, or type them directly. All tags lowercase only.

<break time="1s"/>Insert a pause. Use 2s, 500ms, etc.

<emphasis level="moderate">word</emphasis>Stress a word or phrase. Levels: reduced / moderate / strong.

<prosody rate="slow" pitch="-2st">text</prosody>Tempo and pitch. Rates: x-slow / slow / medium / fast / x-fast. Pitch in semitones (+2st, -2st).

<mstts:express-as style="cheerful">text</mstts:express-as>Emotional style. Options: cheerful, sad, excited, calm, whispering, angry, hopeful, friendly, terrified, shouting, unfriendly.

SSML deep reference

Everything else SSML can do

The cheatsheet above handles most projects. Below is the full surface FreeTTS supports — broken into categories so you can find what you need. All examples are copy-pastable.

Pauses & silence

Two ways to insert silence. `<break>` is the W3C standard. `<mstts:silence>` is an mstts extension that gives you finer placement control.

break (time)

<break time="2s"/>

Hard pause for the exact duration. Units: ms (milliseconds) or s (seconds). Cap is 10s; longer values are clamped.

break (strength)

<break strength="strong"/>

Semantic pause length. Options (shortest → longest): "none", "x-weak", "weak", "medium", "strong", "x-strong". Use strength when you want pauses that scale with the voice's natural cadence; use time when you need exact timing.

mstts:silence (leading)

<mstts:silence type="leading" value="500ms"/>

Adds silence at the very start of the audio. Options for `type`: "leading", "tailing", "sentenceboundary", "leading-exact", "tailing-exact", "sentenceboundary-exact". The -exact variants override the engine's built-in silences instead of adding to them.

mstts:silence (between sentences)

<mstts:silence type="sentenceboundary" value="800ms"/>

Adds 800ms between every sentence in the document. Great for slow audiobook pacing or training content. Place this once at the top of your SSML, not per sentence.

Prosody — rate, pitch, volume

All three attributes accept named keywords, relative percentages, or absolute values. Combine them in a single `<prosody>` tag.

prosody (rate, named)

<prosody rate="x-slow">slowed text</prosody>

Named rates: "x-slow" (0.5x), "slow" (0.7x), "medium" (1.0x), "fast" (1.3x), "x-fast" (1.5x). Easiest to read in scripts.

prosody (rate, percent)

<prosody rate="85%">precisely controlled</prosody>

Absolute percentage (50%-200%). Use when named rates don't hit the exact pacing you want. 85% is a popular audiobook setting.

prosody (rate, relative)

<prosody rate="+20%">slightly faster than ambient</prosody>

Relative shift from the surrounding rate. Useful when you want one paragraph to be a bit faster than the rest without committing to an absolute speed.

prosody (pitch, semitones)

<prosody pitch="-2st">lower pitch</prosody>

Pitch shift in semitones (st). Range roughly -12st to +12st before voices distort. -2st makes most voices sound a touch more serious; +2st adds energy. Also accepts "+2Hz", "x-low", "low", "medium", "high", "x-high", or percentages.

prosody (volume)

<prosody volume="loud">SHOUTING WITHOUT CAPS</prosody>

Named volumes: "silent", "x-soft", "soft", "medium", "loud", "x-loud". Or use decibels (e.g. `volume="+6dB"`). Lets you bake quiet/loud passages into a single output without an audio editor.

prosody (combined)

<prosody rate="slow" pitch="-1st" volume="soft">whispered confession</prosody>

All three attributes work in one tag. Combine for distinctive effects — slow + low + soft = intimate; fast + high + loud = panic.

Pronunciation control

Three ways to fix mispronunciations. Phoneme is the most precise; sub is the easiest; pronunciation dictionary (PRO) is permanent across all your generations.

sub (alias)

Dr. Smith

Replaces the displayed text with the alias for speech only. "Dr." gets spoken as "Doctor". Useful for abbreviations the engine misreads (Mr, Mrs, etc.), medical/legal shorthand, or initialisms.

phoneme (IPA)

<phoneme alphabet="ipa" ph="kəˈmɛrəθ">Camarath</phoneme>

Force-pronounce a word using International Phonetic Alphabet symbols. Best for proper nouns, fantasy names, technical terms. The displayed text is still shown in transcripts; only the audio uses your phoneme spelling.

phoneme (SAPI)

<phoneme alphabet="sapi" ph="t aw m ax t ow">tomato</phoneme>

SAPI phonetic alphabet. Easier to read than IPA if you're not a linguist; only works with English voices. Each phoneme is space-separated.

phoneme (UPS)

<phoneme alphabet="x-microsoft-ups" ph="T1 OW0 M EY1 T OW0">tomato</phoneme>

Universal Phone Set. A cross-language phoneme system. Use the `x-microsoft-ups` alphabet name (the identifier is locked by the SSML standard). Includes optional stress markers (1 = primary, 2 = secondary, 0 = none).

lexicon (PRO)

<lexicon uri="https://your.cdn/pronunciations.xml"/>

Loads an external pronunciation dictionary (W3C PLS XML format). PRO/Creator users typically use the in-app Pronunciation Dictionary instead — same effect, no external hosting needed. Set once at the top of your SSML.

Say-as — interpret strings literally

Forces the engine to read text as a specific kind of data. Without `<say-as>`, the engine guesses ("1234" might be read as "one thousand two hundred thirty-four" or "one two three four" depending on context).

characters / spell-out

<say-as interpret-as="characters">NASA</say-as>

Reads each character individually: "N-A-S-A". Use for initialisms you want spelled letter by letter. "spell-out" is a synonym.

cardinal

<say-as interpret-as="cardinal">12345</say-as>

Reads numbers as cardinal numbers: "twelve thousand three hundred forty-five". Use when context might otherwise force digit-by-digit reading.

ordinal

The <say-as interpret-as="ordinal">3</say-as> rule

Reads as ordinal numbers: "third". Without this, "3" might be read as "three".

digits / number_digit

<say-as interpret-as="digits">2024</say-as>

Reads each digit separately: "two zero two four". Useful for years pronounced digit-by-digit, account IDs, etc.

fraction

<say-as interpret-as="fraction">1/2</say-as>

Reads as a fraction: "one-half" (English) or the locale equivalent. Works for both "1/2" and "1 1/2" formats.

date (mdy / dmy / ymd)

<say-as interpret-as="date" format="dmy">12-03-2026</say-as>

Reads a date with explicit format. Format strings: "mdy", "dmy", "ymd", "md", "dm", "ym", "my", "d", "m", "y", "yyyymmdd". Engine will say "the twelfth of March, two thousand twenty-six". Avoids the US-vs-rest-of-world confusion entirely.

time

<say-as interpret-as="time" format="hms24">15:30:00</say-as>

Reads a time. Formats: "hms12" (am/pm), "hms24" (24-hour), "ms" (minutes:seconds). The above reads as "fifteen thirty hours" / "three thirty PM" depending on locale.

telephone

<say-as interpret-as="telephone">+1-555-867-5309</say-as>

Reads phone numbers naturally with country code grouping. Strips dashes and spaces, reads digits one at a time in conventional groupings.

currency

<say-as interpret-as="currency" language="en-US">$42.50</say-as>

Reads currency with the unit name: "forty-two dollars and fifty cents". Optional `language` attribute helps when the currency symbol is ambiguous.

address

<say-as interpret-as="address">221B Baker St</say-as>

Reads street addresses with the right pacing — number before street, expanding common abbreviations (St → Street, Ave → Avenue).

Language switching mid-document

Use `<lang>` to read a foreign phrase in its native pronunciation without switching voices entirely. The voice has to support the target language for this to sound right.

lang

She said <lang xml:lang="fr-FR">bonjour mon ami</lang> with a smile.

Reads the wrapped text in the specified language using the current voice. Multilingual voices (look for "Multilingual" in the voice name) handle ~12 languages each — best for code-switching scripts. Standard voices may fall back to phonetic approximation.

voice (nested)

<voice name="en-US-AriaNeural">Hello.</voice> <voice name="es-ES-ElviraNeural">Hola.</voice>

Swap voices mid-document. Each `<voice>` block can be a completely different voice in a different language. Useful when you need an actual native speaker for the foreign passage instead of a multilingual voice's approximation. This is what Dialogue mode does for you automatically.

Expressive styles (mstts)

`<mstts:express-as>` is the emotional style tag. Not every voice supports every style. See the "All expressive styles" section below for the complete catalog. Combine with `styledegree` (0.01–2.0) and `role` (for certain Chinese voices).

express-as (basic)

<mstts:express-as style="hopeful">tomorrow will be different</mstts:express-as>

Wraps text in an emotional style. The voice must support the style — Aria, Jenny, Davis (US English) have the widest catalogs; British and other locales have fewer.

express-as (style degree)

<mstts:express-as style="excited" styledegree="2">she's here!</mstts:express-as>

Style intensity. 0.01 = barely-there hint of the emotion; 1.0 (default) = normal; 2.0 = exaggerated. Use higher degrees for dramatic moments, lower for subtle inflection.

express-as (role)

<mstts:express-as style="default" role="YoungAdultFemale">line of dialogue</mstts:express-as>

Roleplay attribute — makes the voice imitate a different speaker type. Options: "Boy", "Girl", "YoungAdultFemale", "YoungAdultMale", "OlderAdultFemale", "OlderAdultMale", "SeniorFemale", "SeniorMale". Currently only some Chinese voices (zh-CN-XiaomoNeural, zh-CN-XiaoxuanNeural, zh-CN-YunxiNeural, zh-CN-YunyeNeural) support this. Pairs well with Dialogue mode for character variety from a single voice.

Structure & markers

Optional but useful for long-form audiobooks, captions, and engines parsing your output.

p (paragraph)

Paragraph one. Two sentences. Paragraph two.

Explicit paragraph boundary. The engine adds a natural pause between `` blocks (slightly longer than between sentences). Useful when your text has weird line breaks the engine would otherwise misinterpret.

s (sentence)

<s>This is a sentence.</s> <s>This is another.</s>

Explicit sentence boundary. Forces the engine to treat the wrapped text as a complete sentence even without terminal punctuation — useful for fragments like "Yes." that might otherwise blend into the next.

bookmark

And then <bookmark mark="chapter-2-start"/> she opened the door.

Inserts a named position marker. Doesn't affect audio but appears in the boundary metadata exposed via /api/v1/tts. Useful when you're building a player that needs to jump to specific scenes.

Background audio (mstts)

Mix a background audio track (music, ambience, white noise) under the synthesized speech. Audio file must be publicly accessible HTTPS.

mstts:backgroundaudio

<mstts:backgroundaudio src="https://example.com/music.mp3" volume="0.4" fadein="2000" fadeout="3000"/>

Plays the source audio under the entire speech track. `volume` is 0.0–1.0 (0.4 = 40% volume — keeps speech intelligible). `fadein`/`fadeout` in milliseconds. Place inside `<speak>` but outside `<voice>`. Only allowed once per document.

Heads up on SSML rules. Three things trip up almost everyone: (1) tag names are lowercase — <Break/> won't work, only <break/>. (2) &, <, and > in your text must be escaped (&, <, >) when inside SSML — the Studio escapes them for you in plain-text mode but not once you start adding tags. (3) DragonHD voices (the :DragonHDLatestNeural ones) reject <mstts:express-as> tags — they read emotion from context instead.

All expressive styles

Every `mstts:express-as` style FreeTTS supports

Not every voice supports every style. The Studio's style chip picker (in Single voice) shows what your selected voice supports. Aria, Jenny, and the Chinese voices Xiaoxiao/Yunxi have the widest catalogs. British and most localized voices typically support just cheerful + sad.

Emotion

cheerfulBright, upbeat. Default pick for happy moments and positive announcements.

sadSlower, lower pitch, downward inflection. Use for grief, regret, somber news.

angryHarder consonants, elevated volume. Adversarial dialogue, frustration.

excitedFaster, higher energy than cheerful. Big-reveal moments, action scenes.

fearfulTrembling, slightly higher pitch, breathy. Suspense and horror.

terrifiedExtreme of fearful — short bursts, rapid breaths, top of the register.

hopefulWarm, slightly tentative, rising inflection. Bridging sad → cheerful.

disgruntledAnnoyed but restrained. Sarcasm, minor complaints.

embarrassedHushed, slightly halting. Apologies, awkward moments.

seriousSteady, even-paced, low expression. News commentary, formal narration.

calmSmooth, lower energy than serious. Meditation, instructions, ASMR adjacent.

Voice quality

whisperingBreathy, very low volume. Intimate scenes, secrets, ASMR. Best paired with prosody volume='soft'.

shoutingMaximum volume + clipped delivery. Battle scenes, distant calls. Use sparingly — listening fatigue is real.

gentleSoft, mid-pitch, evenly paced. Children's storytelling, comforting.

lyricalSlight musical inflection. Poetry, song-like delivery.

Conversational

friendlyWarm and approachable. Default for tutorials, onboarding, support content.

unfriendlyCold, dismissive. Antagonist dialogue, hostile NPC.

empatheticSoft, slow, validating tone. Customer-service apologies, sensitive subjects.

chatCasual, mid-energy. Podcast-style discussion, informal updates.

assistantPolite, helpful, neutral. Built for AI assistants and voice UIs.

customerserviceProfessional friendly. Phone-system-style helpful answers.

Narration

narration-professionalClean audiobook-style narration. The default 'just read it well' choice for long-form.

narration-relaxedLooser pacing than professional. Personal essays, memoir.

documentary-narrationAuthoritative, evenly paced. Educational video voiceovers, science explainers.

newscastGeneric newscaster tone. Use the more specific casual/formal variants if available on your voice.

newscast-casualApproachable newscaster — feature segments, morning shows.

newscast-formalTraditional newscaster gravitas — breaking news, formal reports.

poetry-readingSlower pacing with deliberate emphasis on cadence. Verse, prose poetry.

advertisement-upbeatPunchy, energetic commercial reading. Product launches, promos.

sports-commentaryFaster pace, dynamic stress. Sports calls, live event narration.

sports-commentary-excitedSports-commentary cranked up — game-winning moments.

Voice tiers

Standard vs Neural vs Multilingual vs HD vs DragonHD

FreeTTS exposes 5 voice tiers under one picker. Knowing which tier a voice belongs to matters because they have different SSML quirks and different ideal use cases.

Standard

Free

Legacy concatenative voices. Recognizable robotic edge. Mostly retired in favor of Neural — kept for niche use cases.

When to use: Almost never. The free Neural tier sounds dramatically better at the same cost.

SSML: Full SSML.

Neural

Free + PRO

Standard neural voices — 400+ across 75+ languages. Natural intonation, expressive on supported styles, fast generation. The workhorse tier.

When to use: Default choice for everything. Tutorials, podcasts, video voiceover, casual audiobooks.

SSML: Full SSML including `<mstts:express-as>` for supported styles.

Multilingual

PRO

Single voice that speaks 12+ languages with the same voice identity. Best for code-switching content — your French phrase doesn't suddenly become someone else.

When to use: Bilingual narration, language learning content, scripts with embedded foreign phrases.

SSML: Full SSML. Pairs well with `<lang xml:lang="...">` for mid-sentence language switches.

HD

PRO

High-definition voices. Higher audio bitrate and noticeably more natural prosody. Look for "HDLatest" in the voice name.

When to use: Audiobook narration, professional voiceover, anywhere quality matters more than generation speed.

SSML: Full SSML support.

DragonHD

PRO · Premium

Our newest-generation tier (released 2025). Reads context naturally and automatically expresses emotion from the text without explicit style tags. Look for ":DragonHDLatestNeural" suffix.

When to use: Highest-quality narration when you want emotion to come from the writing itself, not from markup.

SSML: Most SSML works EXCEPT `<mstts:express-as>` style tags (DragonHD rejects them — it reads emotion from context instead). Use other styles like prosody and break normally.

Output formats

MP3 vs WAV vs OGG — which to pick

Format	Specs	Size	When to use
MP3	160 kbps mono · 24 kHz	~1.2 MB per minute	Default. Universal playback, smallest files, good quality. Lossy compression. Final-mix-then-export workflows lose a touch of quality each re-encode.
WAV	16-bit PCM · 24 kHz mono	~2.9 MB per minute	Professional editing in Audition, Pro Tools, Reaper. Mastering. Re-encoding to any other format without quality loss. Uncompressed. Ships at studio-friendly specs. Use this if the file goes into a DAW.
OGG	Opus · 24 kHz	~1.0 MB per minute	Web playback, especially in audio elements without Safari support concerns. Smallest files at equivalent quality to MP3. Open-source codec. Modern browsers all support it; older devices may not.

Voice recommendations

Best voices by use case

Tested-and-recommended picks for common projects. Preview each in the voice picker before committing — taste is individual.

Long-form audiobook (single narrator)

en-US-Andrew:DragonHDLatestNeuralTop pick. DragonHD reads emotion from context — exactly what you want for fiction.
en-US-AvaMultilingualNeuralUse for mixed-language books. Same voice identity across English, Spanish, French, etc.
en-US-JennyNeuralSolid default neural. Wide expressive style support if you want manual control.
en-GB-LibbyNeuralBritish narration for UK-set or period-set fiction.

News, current events, factual content

en-US-AriaNeural with newscast-formalAuthority + clarity. Used by many news automations.
en-US-BrandonNeuralMale newscaster cadence.
en-US-DavisNeuralMid-energy, factual.

Podcast / conversational

en-US-GuyNeural with style='chat'Casual, mid-energy. Sounds like a real podcaster.
en-US-JennyMultilingualNeuralApproachable warm female voice. Multilingual flexibility.
en-US-Christopher:DragonHDLatestNeuralDragonHD male — natural inflection without style hacking.

Educational / explainer

en-US-AriaNeural with documentary-narrationStandard for science explainers and tutorials.
en-US-Emma:DragonHDLatestNeuralDragonHD female. Clean, even-paced explainer voice.

Children's content / storytelling

en-US-AnaNeuralYounger-sounding voice. Designed for kid-friendly content.
en-US-JennyNeural with style='gentle'Warm storytelling tone.

Multi-voice dialogue scenes

Mix any two contrasting voices in Dialogue modePick voices with distinctly different pitches and accents — a young female + an older male reads more clearly than two similar voices.
Add 'role' attribute for Chinese voiceszh-CN-XiaomoNeural, zh-CN-XiaoxuanNeural, etc. can switch between Boy/Girl/YoungAdult/OlderAdult roles for variety from one voice.

Gotchas

Common mistakes and fixes

Uppercase SSML tag names

✗ Wrong<Break time="1s"/>

✓ Right<break time="1s"/>

SSML is case-sensitive. Capitalized tags don't parse and get spoken aloud as text — or fail the whole chunk.

Unescaped `&` in text

✗ WrongTom & Jerry decided to go.

✓ RightTom & Jerry decided to go.

Inside SSML, `&` starts an XML entity. The fix is `&` (or `&` outside any SSML block in PlainText mode). Same goes for `<` (use `<`) and `>` (use `>`) inside SSML.

`<mstts:express-as>` on a DragonHD voice

✗ Wrong<mstts:express-as style="cheerful">…</mstts:express-as> (on en-US-Andrew:DragonHDLatestNeural)

✓ RightJust write expressive text. DragonHD reads emotion from context.

DragonHD voices explicitly reject `mstts:express-as` tags. Strip them out or switch to a Neural voice if you need explicit style control.

Using a style the voice doesn't support

✗ Wrong<mstts:express-as style="poetry-reading">…</mstts:express-as> (on en-US-GuyNeural)

✓ RightCheck the per-voice supported styles. Aria has the most; British voices typically only cheerful + sad.

Unsupported styles are silently ignored — the audio plays neutral instead of poetic. Use the Studio's chip picker to see which styles your selected voice supports.

Forgetting the mstts namespace

✗ Wrong(in raw SSML files outside the Studio)

✓ Rightxmlns:mstts="http://www.w3.org/2001/mstts" on <speak>

If you're writing raw SSML for the API, the `<speak>` root must declare the mstts namespace before any `mstts:` tag will parse. The Studio wraps your text automatically — this only matters for direct API users.

Putting `mstts:backgroundaudio` inside `<voice>`

✗ Wrong<voice name="..."> <mstts:backgroundaudio .../> ... </voice>

✓ Right<speak> <mstts:backgroundaudio .../> <voice name="...">...</voice> </speak>

Background audio is per-document, not per-voice. Place it directly inside `<speak>`, before the `<voice>` block. Only one allowed per document.

Expecting `<audio src='...'/>` to work

✗ Wrong<audio src="https://example.com/clap.mp3"/>

✓ Right

Inline audio injection isn't supported. Use the Timeline to chain pre-generated clips, or `mstts:backgroundaudio` for a single underlay track.

Arbitrary inline `<audio src>` clips aren't supported by the synthesis path; use `mstts:backgroundaudio` for one underlay, or chain separate clips in the Timeline.

Pro tips

Power-user shortcuts

Voice previews are free. The play button next to each voice in the picker uses a separate quota-free endpoint. Audition 50 voices before committing to one — none of it counts toward your monthly chars.
HD voices cost the same as Neural in char usage, but synthesis is ~20% slower. For interactive Studio work that's invisible; on a long render it adds a little time. Usually still worth it.
Regenerate one line, not the whole thing. Hover a line (in Dialogue, or single-voice 'Regenerate by paragraph') and hit ↻ for a fresh take of just that line — it re-stitches automatically, and you can step between takes with ‹ ›.
Save your project (PRO). 'Save project' keeps a named, server-saved copy you can reopen on any device — pick up exactly where you left off.
For very long narration, render in chunks under your per-render cap and chain them in the Timeline. Set the pause after each, then merge to one continuous MP3.
Add to Timeline after every render. It's cheaper than regenerating later — if you might want to chain clips, save them as you go instead of hunting for them in History.
Pronunciation dictionary is permanent. PRO/Creator can add `word → IPA` mappings that auto-apply across all generations. Much cleaner than wrapping every instance of 'Camarath' in `<phoneme>` tags.
Free tier: 1,000 chars/generation. PRO: 10,000. Creator: 25,000. For longer pieces, render in chunks and chain them in the Timeline.

Hard caps

Plan limits at a glance

Limit	Free	PRO ($19/mo)	Creator ($39/mo)
Per-generation chars (Single)	1,000	10,000	25,000
Monthly chars (total)	5,000	1,000,000	5,000,000
HD voices	Preview only	✓	✓
DragonHD voices	Preview only	✓	✓
Multilingual voices	Preview only	✓	✓
Dialogue mode	—	✓	✓
Timeline editor	—	✓	✓
Audiobook batch (Creator dashboard)	—	—	✓
Voice cloning	—	—	1 clone · 100K cloned chars/mo
Pronunciation dictionary	—	✓	✓
Generation history	—	30 days	30 days
Commercial license	—	✓	✓
WAV / OGG export	MP3 only	✓	✓
SRT subtitle export	✓	✓	✓
Free API key	—	✓	✓

Pronunciation

Fixing words the voice mispronounces

Three escalation paths — pick the smallest one that works for your case.

Quickest: rewrite phonetically

If a word is mispronounced once or twice, rewrite the spelling so the engine reads it correctly. Camarath → kuh-MARE-uth. Ugly in the transcript but fixes one-offs without any SSML.

One-shot fix: ``

Replace what the engine sees with what you want it to say.

Dr. Watson said hello.

Use this for abbreviations and initialisms the engine misreads. The transcript still shows Dr.; only the audio uses Doctor.

Precise: `<phoneme alphabet="ipa">`

Force exact pronunciation with International Phonetic Alphabet symbols.

<phoneme alphabet="ipa" ph="kəˈmɛrəθ">Camarath</phoneme>

The ph attribute holds the IPA. The displayed text is preserved; only the audio uses your spelling. If you don't know IPA, the dashboard's Pronunciation Dictionary has a "Hear it" button that lets you audition different IPA strings until one sounds right.

Permanent: Pronunciation Dictionary PRO

Add Camarath → kəˈmɛrəθ once in the dashboard (Studio settings → Pronunciation). Every future generation across all modes uses it. Much cleaner than wrapping every instance in <phoneme>. Up to 500 mappings per account.

IPA cheat sheet for common English sounds

Sound	IPA	Example word
"a" in cat	`æ`	c`æ`t
"a" in father	`ɑ`	f`ɑ`ther
"e" in bed	`ɛ`	b`ɛ`d
"i" in machine	`i`	m`i`shin
"i" in bit	`ɪ`	b`ɪ`t
"o" in note	`oʊ`	n`oʊ`t
"u" in moon	`u`	m`u`n
schwa (the most common vowel)	`ə`	`ə`bout
"th" in thin	`θ`	`θ`in
"th" in this	`ð`	`ð`is
"sh" in shoe	`ʃ`	`ʃ`oo
"zh" in measure	`ʒ`	mea`ʒ`ure
Primary stress (placed before stressed syllable)	`ˈ`	c`ˈ`amera

Voice cloning

Creator-tier voice cloning — quick start

Train a voice that sounds like you (or any voice you have rights to record) from 60 seconds of audio. Once trained, your custom voice appears in the picker like any other voice — works in Single voice and Dialogue, and its takes chain in the Timeline.

What you need

60 seconds of clean audio of one person speaking naturally
No background music, no echo, no overlapping voices
USB mic or phone with a quiet room is enough — studio gear unnecessary
Variety: read a paragraph or two with a mix of question, statement, exclamation
WAV or MP3 file, 16 kHz or higher sample rate

How to clone

Open Dashboard → Voice Cloning.
Click "New voice" and give it a memorable name.
Upload your 60-second audio clip. Or record live in the browser.
Wait ~5 minutes. Status flips to "Ready".
Open Studio → your cloned voice shows in the picker under a "My voices" section.

Quality tips

Sample variety beats length. 60 seconds of varied prose beats 5 minutes of monotone reading.
Match the use case. If you'll narrate audiobooks, train on a paragraph being narrated. If you'll use it for dialogue, train on conversational delivery.
Avoid plosives. Pop filter on the mic, or position it slightly off-axis (5-10 degrees) so "p" and "b" bursts don't spike.
Clean room. Hard reverb (bathroom, empty kitchen) makes the clone sound boxy. Soft furnishings, closet with hanging clothes, or under a blanket beats a fancy mic in a bad room.

Limits & quotas

1 active clone per Creator account. Delete your existing clone to make room for a new one.
100,000 cloned characters/month dedicated quota (separate from your regular 5M).
Cancel an unused clone any time to free a slot.
Cloned voices count toward the same per-render char cap as any other voice (free 1,000 · PRO 10,000 · Creator 25,000).

Ethical note

Only clone voices you own (yourself, voice actors who've consented in writing, public-domain recordings older than ~95 years). FreeTTS audits for likely-unauthorized clones and removes them. Cloning a celebrity, a podcast host, or any non-consenting voice violates the terms of service.

FAQ

Common questions about Studio

Which mode should I start in?

Single voice. It has the gentlest learning curve — type text, pick a voice, hit Generate. Most users only need Single voice to find their preferred voice and pacing. Move on to Dialogue when you need multiple characters, and add renders to the Timeline when you want to chain several clips into one file. Per-render caps: free 1,000 · PRO 10,000 · Creator 25,000 characters.

Does Single voice go through my monthly character budget?

Yes. Every Generate in Single voice counts. Voice previews (the small play button next to a voice in the picker) do NOT count — those use a separate quota-free endpoint. Test as many voices as you want via Preview without burning chars.

Dialogue made one MP3 with all speakers. Can I get the per-speaker files separately?

Not directly. Dialogue muxes the conversation server-side and returns one merged file. If you need separate files per speaker, generate each speaker's lines in Single voice (switching voices between them), then add each one to the Timeline for chaining. It's slower but gives you full control over the individual audio files.

What does the Timeline do?

The Timeline is an overlay that assembles audio you already generated. After any Single voice or Dialogue render, click 'Add to Timeline' below the player. Reorder clips, set a pause after each (0–10s), preview any clip, and merge into one continuous MP3. Great for multi-scene projects: intro narration → dialogue scene 1 → narrator break → dialogue scene 2 → outro.

Can I fix one bad line without re-rendering everything?

Yes — that's regenerate-one-line. In Dialogue (and single-voice 'Regenerate by paragraph' mode), hover any line and hit ↻ to render just that line as a fresh take; the rest of the track is untouched and it re-stitches automatically. Step between takes with ‹ › — it's non-destructive, so a worse take is one tap to undo.

Can I use SSML markup in Single voice too?

Yes — Single voice and Dialogue both accept SSML. The most useful tags are <break time="1s"/>, <emphasis level="strong">, <prosody rate="slow">, and <mstts:express-as style="cheerful">. Select text and right-click (or long-press on touch) to insert them, or type them directly. All tags are lowercase — <Break/> with a capital B won't work.

What's the difference between HD voices and DragonHD voices?

HD voices (like en-US-Aria:HDLatestNeural) are higher-bitrate versions of standard neural voices. Same expressive control via mstts:express-as tags, just better audio quality. DragonHD voices (like en-US-Andrew:DragonHDLatestNeural) are our newest-generation tier and automatically read emotion from the text itself. You write "She gasped in horror" and DragonHD delivers it dramatically without you needing the terrified style. The catch: DragonHD ignores mstts:express-as tags entirely, so manual style control is gone. Use HD when you want explicit control. Use DragonHD when you want natural delivery from natural writing.

Why does my style tag get ignored?

Five common reasons. First, your voice doesn't support that style — only certain voices have specific styles (the Studio chip picker shows what your selected voice supports). Second, you're on a DragonHD voice, which reads emotion from context and ignores explicit style tags entirely. Third, the wrapped span is too short — emotions need at least 3 words to be reliably audible because the voice needs time to transition into and out of the style; single-word emotion wraps often render as neutral. Wrap a longer phrase, or bump the intensity slider higher. Fourth, the wrap crosses a sentence boundary — every period, exclamation, or question mark resets the voice's prosody, so a single emotion wrap that spans two sentences usually only renders the style on one of them. Wrap each sentence separately for the most reliable result. Fifth, you used the style outside an mstts:express-as wrapper. The Studio's emotion picker wraps the SSML envelope for you, so this only happens if you're writing raw SSML for the API.

Can I use SSML in the homepage box (not Studio)?

Yes. The homepage /api/tts endpoint and the browser extension both build a full SSML envelope from your text. Drop in <break/>, <emphasis>, <prosody>, even mstts:express-as — they all work on the homepage if the underlying voice supports them. The 1,000-char free-tier per-generation cap still applies though.

What's styledegree and when should I change it?

styledegree controls how intense the emotional style is. Default is 1.0. Range is 0.01 (barely-there hint) to 2.0 (cranked up). For most narration the default works fine. Bump to 1.5 or 2.0 for dramatic moments (a battle cry, a grief explosion). Drop to 0.5 for subtle inflection (a hint of sadness under a brave face). Add it as an attribute: <mstts:express-as style="sad" styledegree="0.5">.

How do I make a pause longer than 10 seconds?

The `<break time>` tag caps at 10 seconds. For longer silences, stack multiple breaks: <break time="10s"/><break time="10s"/><break time="5s"/> gives you 25 seconds. Or use mstts:silence at the top of your document to set a global between-sentence silence if you just want spacious pacing throughout.

Does Studio support voice cloning?

Yes, on the Creator plan. You record 60 seconds of clean audio of yourself talking naturally (no background noise, normal pace), upload it from /dashboard?section=voice-clone, and the system trains a personal voice in about 5 minutes. Once cloned, your voice appears in the picker like any other voice. Creator includes 1 voice clone and 100,000 cloned characters per month. Quality scales with the recording — invest in 60 seconds of really clean audio and the clone will sound impressively close.

What happens if my SSML is invalid?

Studio returns a clear error in the UI telling you what's wrong — usually with the line position — so you can fix it and re-render. A quick sanity check: keep tags lowercase and balanced (every <emphasis> has a matching </emphasis>), and note that the right-click menu inserts well-formed tags for you, so you rarely have to hand-write them.

Can I generate audio in two languages in one file?

Yes, three ways. Easiest: pick a Multilingual voice (look for "Multilingual" in the voice name) which can speak ~12 languages with one identity. Most precise: use <lang xml:lang="fr-FR">phrase</lang> inside any voice to switch languages mid-sentence for that span. Most flexible: use Dialogue mode and assign each language to a different voice — you get separate native speakers for each language.

Why does the same voice sound slightly different on different generations?

Neural voices have a small amount of natural variation — the same text rendered twice won't be byte-identical even with the same voice and settings. Variation is normally subtle (different breath placement, slight rhythm differences). Larger differences usually come from context — the voice reads a question differently from a statement, an exclamation differently from a declaration. This is a feature, not a bug, and is part of what makes neural voices feel human.

What's the Pronunciation Dictionary?

A PRO/Creator feature that stores custom word → pronunciation mappings that auto-apply to ALL your generations. Add 'Camarath' → 'kəˈmɛrəθ' once, and every time you write Camarath in any Studio mode it gets pronounced correctly. Much cleaner than wrapping every instance in <phoneme>. Find it in the dashboard sidebar under Studio settings. Uses International Phonetic Alphabet (IPA) — there's a 'Hear it' button to audition before saving.

Can I set per-segment voices in Dialogue mode without writing SSML?

Yes — that's exactly what Dialogue is for. Write your script as `Speaker: line of dialogue` (one per row). Assign each speaker a voice in the picker. Dialogue mode auto-generates the multi-voice SSML for you. Per-line style overrides are also supported via the small style dropdown next to each line. You only need to drop into raw SSML if you want effects beyond voice + style (custom pauses, prosody, etc.).

What's the audio quality difference between MP3 / WAV / OGG?

All three are generated from the same 24 kHz neural synthesis source. MP3 (default, 160 kbps) is universal and small. WAV is uncompressed PCM — same source quality, ~2.5x file size, ideal for re-editing in a DAW. OGG/Opus is the most modern, smallest at equivalent quality, but older browsers and devices don't support it. Pick MP3 unless you have a specific reason to use the others.

What each mode actually does

Which one for what

"I want to make…"

I have a single narrator and a 50-page chapter.

I have a story with 3 characters who all speak.

I want emotional variation across paragraphs (sad scene then cheerful scene).

I want to test 10 voices to find my favorite for a project.

The 4 tags that cover 95% of use cases

Everything else SSML can do

Pauses & silence

Prosody — rate, pitch, volume

Pronunciation control

Say-as — interpret strings literally

Language switching mid-document

Expressive styles (mstts)

Structure & markers

Background audio (mstts)

Every mstts:express-as style FreeTTS supports

Emotion

Voice quality

Conversational

Narration

Standard vs Neural vs Multilingual vs HD vs DragonHD

Standard

Neural

Multilingual

HD

DragonHD

MP3 vs WAV vs OGG — which to pick

Best voices by use case

Long-form audiobook (single narrator)

News, current events, factual content

Podcast / conversational

Educational / explainer

Children's content / storytelling

Multi-voice dialogue scenes

Common mistakes and fixes

Uppercase SSML tag names

Unescaped `&` in text

`<mstts:express-as>` on a DragonHD voice

Using a style the voice doesn't support

Forgetting the mstts namespace

Putting `mstts:backgroundaudio` inside `<voice>`

Expecting `<audio src='...'/>` to work

Power-user shortcuts

Plan limits at a glance

Fixing words the voice mispronounces

Quickest: rewrite phonetically

One-shot fix: <sub alias="...">

Precise: <phoneme alphabet="ipa">

Permanent: Pronunciation Dictionary PRO

IPA cheat sheet for common English sounds

Creator-tier voice cloning — quick start

What you need

How to clone

Quality tips

Limits & quotas

Ethical note

Common questions about Studio

While you're here

What each mode actually does

Which one for what

"I want to make…"

I have a single narrator and a 50-page chapter.

I have a story with 3 characters who all speak.

I want emotional variation across paragraphs (sad scene then cheerful scene).

I want to test 10 voices to find my favorite for a project.

The 4 tags that cover 95% of use cases

Everything else SSML can do

Pauses & silence

Prosody — rate, pitch, volume

Pronunciation control

Say-as — interpret strings literally

Language switching mid-document

Expressive styles (mstts)

Structure & markers

Background audio (mstts)

Every mstts:express-as style FreeTTS supports

Emotion

Voice quality

Every `mstts:express-as` style FreeTTS supports

One-shot fix: `<sub alias="...">`

Precise: `<phoneme alphabet="ipa">`

Every `mstts:express-as` style FreeTTS supports

One-shot fix: `<sub alias="...">`

Precise: `<phoneme alphabet="ipa">`