Four modes. One Studio. Most users only need Single. The rest unlock when your project gets longer, has multiple characters, or needs scene-by-scene control.
The four modes
When to use: One block of text. Voice testing. Quick clips. Tweaking pitch / rate / style per generation.
The default Studio tab. Paste up to 25k chars, pick any voice, optionally tweak style (cheerful, sad, whispering, etc.), speed, and pitch. Each click of Generate produces one MP3 saved to your history. Good for previews, short narration, social media voiceovers, and assembling longer projects with Timeline.
Output: One MP3 per click. Add it to Timeline to chain with other clips.
When to use: Conversations between characters. Multi-speaker scenes. Anything with line-by-line voice changes.
Write your script in the format `Speaker: line`, one line per row. Assign each speaker a different voice. Hit Generate and Dialogue muxes the conversation server-side — you get back ONE merged MP3 with all speakers, pauses between lines (configurable per-line), and proper turn-taking. Saves you from stitching 30 separate MP3s by hand.
Output: One merged MP3 with all speakers. Add it to Timeline to combine with other scenes.
When to use: Chaining multiple Single or Dialogue outputs into one continuous file with custom pauses between clips.
Timeline holds clips you already generated. After each Single or Dialogue render, click 'Add to Timeline' below the audio player. Drag clips to reorder, set pause-after for each (0–10s), preview individual clips, and merge into one continuous MP3 when ready. Useful for multi-scene projects: intro narration → dialogue scene 1 → narrator break → dialogue scene 2 → outro.
Output: One continuous MP3 with all clips and pauses baked in.
When to use: Full novels. Long chapters. Course modules. Anything past Single's 25k-char cap that doesn't need per-line voice swapping.
Paste your entire chapter or novel into the Script box, give it a title, pick a voice and output format. The job runs server-side on Azure's batch synthesis API (typically half the audio length in wall-clock time — 1 hour of audio = ~30 min job). When done, you get a ZIP of per-chunk MP3s + SRT subtitles. Auto-detects SSML markup, so you can mix in <break/>, <prosody>, <emphasis>, and <mstts:express-as style="..."> tags per paragraph.
Output: ZIP of MP3s + SRTs. Available for download from the Recent batches panel.
At a glance
| Need | Use | Why |
|---|---|---|
| Test a voice on real text | Single | Per-click control, full style/pitch/rate sliders. |
| One narrator, 50+ pages | Audiobook | Single's 25K cap is the wall. Audiobook does up to 2M. |
| Multiple characters speaking | Dialogue | Server-side mux into one MP3 with proper pauses. |
| Chain Single clips with pauses | Timeline | Editor-style ordering + merge. |
| Chain Dialogue scenes with intros | Dialogue → Timeline | Render scenes in Dialogue, chain in Timeline. |
| Per-paragraph emotion (sad / cheerful / whispering) | Audiobook (SSML) | Wrap paragraphs in <mstts:express-as style="...">. |
| Custom pauses between paragraphs | Audiobook (SSML) | Use <break time="1.5s"/>. |
| Programmatic generation from scripts | Developer API | See /developers. |
Common workflows
SSML quick reference
SSML is the markup language inside the text input that controls pauses, emphasis, pitch, and emotion. It works in Single, Dialogue, and Audiobook. All tags lowercase only.
<break time="1s"/>Insert a pause. Use 2s, 500ms, etc.<emphasis level="moderate">word</emphasis>Stress a word or phrase. Levels: reduced / moderate / strong.<prosody rate="slow" pitch="-2st">text</prosody>Tempo and pitch. Rates: x-slow / slow / medium / fast / x-fast. Pitch in semitones (+2st, -2st).<mstts:express-as style="cheerful">text</mstts:express-as>Emotional style. Options: cheerful, sad, excited, calm, whispering, angry, hopeful, friendly, terrified, shouting, unfriendly.SSML deep reference
The cheatsheet above handles most projects. Below is the full surface FreeTTS supports — broken into categories so you can find what you need. All examples are copy-pastable.
Two ways to insert silence. `<break>` is the W3C standard. `<mstts:silence>` is Azure-only but gives you finer placement control.
All three attributes accept named keywords, relative percentages, or absolute values. Combine them in a single `<prosody>` tag.
Three ways to fix mispronunciations. Phoneme is the most precise; sub is the easiest; pronunciation dictionary (PRO) is permanent across all your generations.
Forces the engine to read text as a specific kind of data. Without `<say-as>`, the engine guesses ("1234" might be read as "one thousand two hundred thirty-four" or "one two three four" depending on context).
Use `<lang>` to read a foreign phrase in its native pronunciation without switching voices entirely. The voice has to support the target language for this to sound right.
`<mstts:express-as>` is Azure's emotional style tag. Not every voice supports every style — see the "All expressive styles" section below for the complete catalog. Combine with `styledegree` (0.01–2.0) and `role` (for certain Chinese voices).
Optional but useful for long-form audiobooks, captions, and engines parsing your output.
Mix a background audio track (music, ambience, white noise) under the synthesized speech. Audio file must be publicly accessible HTTPS.
<Break/> won't work, only <break/>. (2) &, <, and > in your text must be escaped (&, <, >) when inside SSML — the Studio escapes them for you in plain-text mode but not once you start adding tags. (3) DragonHD voices (the :DragonHDLatestNeural ones) reject <mstts:express-as> tags — they read emotion from context instead.All expressive styles
mstts:express-as style FreeTTS supportsNot every voice supports every style. The Studio's style chip picker (in Single mode) shows what your selected voice supports. Aria, Jenny, and the Chinese voices Xiaoxiao/Yunxi have the widest catalogs. British and most localized voices typically support just cheerful + sad.
cheerfulBright, upbeat. Default pick for happy moments and positive announcements.sadSlower, lower pitch, downward inflection. Use for grief, regret, somber news.angryHarder consonants, elevated volume. Adversarial dialogue, frustration.excitedFaster, higher energy than cheerful. Big-reveal moments, action scenes.fearfulTrembling, slightly higher pitch, breathy. Suspense and horror.terrifiedExtreme of fearful — short bursts, rapid breaths, top of the register.hopefulWarm, slightly tentative, rising inflection. Bridging sad → cheerful.disgruntledAnnoyed but restrained. Sarcasm, minor complaints.embarrassedHushed, slightly halting. Apologies, awkward moments.seriousSteady, even-paced, low expression. News commentary, formal narration.calmSmooth, lower energy than serious. Meditation, instructions, ASMR adjacent.whisperingBreathy, very low volume. Intimate scenes, secrets, ASMR. Best paired with prosody volume='soft'.shoutingMaximum volume + clipped delivery. Battle scenes, distant calls. Use sparingly — listening fatigue is real.gentleSoft, mid-pitch, evenly paced. Children's storytelling, comforting.lyricalSlight musical inflection. Poetry, song-like delivery.friendlyWarm and approachable. Default for tutorials, onboarding, support content.unfriendlyCold, dismissive. Antagonist dialogue, hostile NPC.empatheticSoft, slow, validating tone. Customer-service apologies, sensitive subjects.chatCasual, mid-energy. Podcast-style discussion, informal updates.assistantPolite, helpful, neutral. Built for AI assistants and voice UIs.customerserviceProfessional friendly. Phone-system-style helpful answers.narration-professionalClean audiobook-style narration. The default 'just read it well' choice for long-form.narration-relaxedLooser pacing than professional. Personal essays, memoir.documentary-narrationAuthoritative, evenly paced. Educational video voiceovers, science explainers.newscastGeneric newscaster tone. Use the more specific casual/formal variants if available on your voice.newscast-casualApproachable newscaster — feature segments, morning shows.newscast-formalTraditional newscaster gravitas — breaking news, formal reports.poetry-readingSlower pacing with deliberate emphasis on cadence. Verse, prose poetry.advertisement-upbeatPunchy, energetic commercial reading. Product launches, promos.sports-commentaryFaster pace, dynamic stress. Sports calls, live event narration.sports-commentary-excitedSports-commentary cranked up — game-winning moments.Voice tiers
FreeTTS exposes 5 voice tiers under one picker. Knowing which tier a voice belongs to matters because they have different SSML quirks and different ideal use cases.
Legacy concatenative voices. Recognizable robotic edge. Mostly retired in favor of Neural — kept for niche use cases.
When to use: Almost never. The free Neural tier sounds dramatically better at the same cost.
SSML: Full SSML.
Standard neural voices — 400+ across 75+ languages. Natural intonation, expressive on supported styles, fast generation. The workhorse tier.
When to use: Default choice for everything. Tutorials, podcasts, video voiceover, casual audiobooks.
SSML: Full SSML including `<mstts:express-as>` for supported styles.
Single voice that speaks 12+ languages with the same voice identity. Best for code-switching content — your French phrase doesn't suddenly become someone else.
When to use: Bilingual narration, language learning content, scripts with embedded foreign phrases.
SSML: Full SSML. Pairs well with `<lang xml:lang="...">` for mid-sentence language switches.
High-definition voices. Higher audio bitrate and noticeably more natural prosody. Look for "HDLatest" in the voice name.
When to use: Audiobook narration, professional voiceover, anywhere quality matters more than generation speed.
SSML: Full SSML support.
Newest-generation Microsoft voices (released 2025). Reads context naturally — automatically expresses emotion from the text without explicit style tags. Look for ":DragonHDLatestNeural" suffix.
When to use: Highest-quality narration when you want emotion to come from the writing itself, not from markup.
SSML: Most SSML works EXCEPT `<mstts:express-as>` style tags (DragonHD rejects them — it reads emotion from context instead). Use other styles like prosody and break normally.
Output formats
| Format | Specs | Size | When to use |
|---|---|---|---|
| MP3 | 160 kbps mono · 24 kHz | ~1.2 MB per minute | Default. Universal playback, smallest files, good quality. Lossy compression. Final-mix-then-export workflows lose a touch of quality each re-encode. |
| WAV | 16-bit PCM · 24 kHz mono | ~2.9 MB per minute | Professional editing in Audition, Pro Tools, Reaper. Mastering. Re-encoding to any other format without quality loss. Uncompressed. Ships at studio-friendly specs. Use this if the file goes into a DAW. |
| OGG | Opus · 24 kHz | ~1.0 MB per minute | Web playback, especially in audio elements without Safari support concerns. Smallest files at equivalent quality to MP3. Open-source codec. Modern browsers all support it; older devices may not. |
Voice recommendations
Tested-and-recommended picks for common projects. Preview each in the voice picker before committing — taste is individual.
en-US-Andrew:DragonHDLatestNeuralTop pick. DragonHD reads emotion from context — exactly what you want for fiction.en-US-AvaMultilingualNeuralUse for mixed-language books. Same voice identity across English, Spanish, French, etc.en-US-JennyNeuralSolid default neural. Wide expressive style support if you want manual control.en-GB-LibbyNeuralBritish narration for UK-set or period-set fiction.en-US-AriaNeural with newscast-formalAuthority + clarity. Used by many news automations.en-US-BrandonNeuralMale newscaster cadence.en-US-DavisNeuralMid-energy, factual.en-US-GuyNeural with style='chat'Casual, mid-energy. Sounds like a real podcaster.en-US-JennyMultilingualNeuralApproachable warm female voice. Multilingual flexibility.en-US-Christopher:DragonHDLatestNeuralDragonHD male — natural inflection without style hacking.en-US-AriaNeural with documentary-narrationStandard for science explainers and tutorials.en-US-Emma:DragonHDLatestNeuralDragonHD female. Clean, even-paced explainer voice.en-US-AnaNeuralYounger-sounding voice. Designed for kid-friendly content.en-US-JennyNeural with style='gentle'Warm storytelling tone.Mix any two contrasting voices in Dialogue modePick voices with distinctly different pitches and accents — a young female + an older male reads more clearly than two similar voices.Add 'role' attribute for Chinese voiceszh-CN-XiaomoNeural, zh-CN-XiaoxuanNeural, etc. can switch between Boy/Girl/YoungAdult/OlderAdult roles for variety from one voice.Gotchas
<Break time="1s"/><break time="1s"/>SSML is case-sensitive. Capitalized tags don't parse and get spoken aloud as text — or fail the whole chunk.
Tom & Jerry decided to go.Tom & Jerry decided to go.Inside SSML, `&` starts an XML entity. The fix is `&` (or `&` outside any SSML block in PlainText mode). Same goes for `<` (use `<`) and `>` (use `>`) inside SSML.
<mstts:express-as style="cheerful">…</mstts:express-as> (on en-US-Andrew:DragonHDLatestNeural)Just write expressive text. DragonHD reads emotion from context.DragonHD voices explicitly reject `mstts:express-as` tags. Strip them out or switch to a Neural voice if you need explicit style control.
<mstts:express-as style="poetry-reading">…</mstts:express-as> (on en-US-GuyNeural)Check the per-voice supported styles. Aria has the most; British voices typically only cheerful + sad.Unsupported styles are silently ignored — the audio plays neutral instead of poetic. Use the Studio's chip picker to see which styles your selected voice supports.
(in raw SSML files outside the Studio)xmlns:mstts="http://www.w3.org/2001/mstts" on <speak>If you're writing raw SSML for the API, the `<speak>` root must declare the mstts namespace before any `mstts:` tag will parse. Studio's Audiobook tool wraps your text automatically — this only matters for direct API users.
<voice name="..."> <mstts:backgroundaudio .../> ... </voice><speak> <mstts:backgroundaudio .../> <voice name="...">...</voice> </speak>Background audio is per-document, not per-voice. Place it directly inside `<speak>`, before the `<voice>` block. Only one allowed per document.
<audio src="https://example.com/clap.mp3"/>Audiobook batch does not support inline audio injection. Use Timeline mode to chain pre-generated clips, or `mstts:backgroundaudio` for a single underlay track.Azure's real-time API supports inline `<audio>` but the batch synthesis path we use for Audiobook mode does not. Plan accordingly for long-form projects.
Pro tips
Hard caps
| Limit | Free | PRO ($19/mo) | Creator ($39/mo) |
|---|---|---|---|
| Per-generation chars (Single) | 1,000 | 10,000 | 25,000 |
| Monthly chars (total) | 5,000 | 1,000,000 | 5,000,000 |
| Audiobook batch (per job) | — | 2,000,000 | 2,000,000 |
| Concurrent batch jobs | — | 3 | 3 |
| HD voices | Preview only | ✓ | ✓ |
| DragonHD voices | Preview only | ✓ | ✓ |
| Multilingual voices | Preview only | ✓ | ✓ |
| Dialogue mode | — | ✓ | ✓ |
| Timeline editor | — | ✓ | ✓ |
| Audiobook batch | — | — | ✓ |
| Voice cloning | — | — | 1 clone · 100K cloned chars/mo |
| Pronunciation dictionary | — | ✓ | ✓ |
| Generation history | — | 30 days | 30 days |
| Commercial license | — | ✓ | ✓ |
| WAV / OGG export | MP3 only | ✓ | ✓ |
| SRT subtitle export | ✓ | ✓ | ✓ |
| Free API key | — | ✓ | ✓ |
Pronunciation
Three escalation paths — pick the smallest one that works for your case.
If a word is mispronounced once or twice, rewrite the spelling so the engine reads it correctly. Camarath → kuh-MARE-uth. Ugly in the transcript but fixes one-offs without any SSML.
<sub alias="...">Replace what the engine sees with what you want it to say.
Dr. <sub alias="Watson">Watson</sub> said hello.Use this for abbreviations and initialisms the engine misreads. The transcript still shows Dr.; only the audio uses Doctor.
<phoneme alphabet="ipa">Force exact pronunciation with International Phonetic Alphabet symbols.
<phoneme alphabet="ipa" ph="kəˈmɛrəθ">Camarath</phoneme>The ph attribute holds the IPA. The displayed text is preserved; only the audio uses your spelling. If you don't know IPA, the dashboard's Pronunciation Dictionary has a "Hear it" button that lets you audition different IPA strings until one sounds right.
Add Camarath → kəˈmɛrəθ once in the dashboard (Studio settings → Pronunciation). Every future generation across all modes uses it. Much cleaner than wrapping every instance in <phoneme>. Up to 500 mappings per account.
| Sound | IPA | Example word |
|---|---|---|
| "a" in cat | æ | cæt |
| "a" in father | ɑ | fɑther |
| "e" in bed | ɛ | bɛd |
| "i" in machine | i | mishin |
| "i" in bit | ɪ | bɪt |
| "o" in note | oʊ | noʊt |
| "u" in moon | u | mun |
| schwa (the most common vowel) | ə | əbout |
| "th" in thin | θ | θin |
| "th" in this | ð | ðis |
| "sh" in shoe | ʃ | ʃoo |
| "zh" in measure | ʒ | meaʒure |
| Primary stress (placed before stressed syllable) | ˈ | cˈamera |
Voice cloning
Train a voice that sounds like you (or any voice you have rights to record) from 60 seconds of audio. Once trained, your custom voice appears in the picker like any other voice — works in Single, Dialogue, Timeline, and Audiobook.
Only clone voices you own (yourself, voice actors who've consented in writing, public-domain recordings older than ~95 years). FreeTTS audits for likely-unauthorized clones and removes them. Cloning a celebrity, a podcast host, or any non-consenting voice violates the terms of service.
FAQ
Single. It has the gentlest learning curve — type text, pick a voice, hit Generate. Most users only need Single to find their preferred voice and pacing. Move on to Dialogue when you need multiple characters, Timeline when you want to chain clips, and Audiobook when your project is too long for Single's 25,000-char cap.
Yes. Every Generate click in Single counts. Voice previews (the small play button next to a voice in the picker) do NOT count — those use a separate quota-free endpoint. Test as many voices as you want via Preview without burning chars.
Not directly. Dialogue muxes the conversation server-side and returns one merged file. If you need separate files per speaker, generate each speaker's lines in Single mode (switching voices between them), then add each one to Timeline for chaining. It's slower but gives you full control over the individual audio files.
Timeline assembles audio you already generated. Audiobook generates new audio from text. Use Timeline when you want to merge specific clips (e.g., chain three dialogue scenes plus a narrator intro). Use Audiobook when you want to give it a full chapter and get back a finished ZIP. They serve different jobs — Timeline is editor-style, Audiobook is batch-pipeline-style.
Audiobook splits long text into chunks (Azure's batch API caps each chunk at ~10,000 chars). The ZIP contains one MP3 per chunk plus SRT subtitles. If you want a single merged file, you can either run the chunks through Timeline (add them all + merge) or use an external tool like ffmpeg / Audacity.
Yes — Single and Dialogue both accept SSML. The most useful tags are <break time="1s"/>, <emphasis level="strong">, <prosody rate="slow">, and <mstts:express-as style="cheerful">. Audiobook auto-detects SSML too (as of May 2026). All tags are lowercase — <Break/> with a capital B won't work.
Each Audiobook submission writes a single row to your history with the title and total chars, not one row per internal chunk. The History table's Source column will label it 'Audiobook' so you can tell it apart from Single (web), Dialogue (studio), and API entries.
Azure's batch service typically completes in roughly half the audio's playback length. A 1-hour audiobook takes ~30 minutes. Watch the dashboard 'Recent batches' panel for live progress. The job runs server-side, so you can close the tab — you'll see the finished link on next visit.
HD voices (like en-US-Aria:HDLatestNeural) are higher-bitrate versions of standard neural voices — same expressive control via mstts:express-as, just better audio quality. DragonHD voices (like en-US-Andrew:DragonHDLatestNeural) are a newer-generation model from Microsoft that automatically reads emotion from the text itself. You write "She gasped in horror" and DragonHD delivers it dramatically without you needing the terrified style. The catch: DragonHD rejects mstts:express-as tags entirely, so manual style control is gone. Use HD when you want explicit control; use DragonHD when you want natural delivery from natural writing.
Three common reasons. First, your voice doesn't support that style — only certain voices have specific styles (the Studio chip picker shows what your selected voice supports). Second, you're on a DragonHD voice, which rejects express-as tags entirely. Third, you used the style outside an mstts:express-as wrapper. Audiobook auto-wraps the SSML envelope for you, so this only happens if you're writing raw SSML for the API.
Yes. The homepage /api/tts endpoint and the browser extension both build a full SSML envelope from your text. Drop in <break/>, <emphasis>, <prosody>, even mstts:express-as — they all work on the homepage if the underlying voice supports them. The 1,000-char free-tier per-generation cap still applies though.
styledegree controls how intense the emotional style is. Default is 1.0. Range is 0.01 (barely-there hint) to 2.0 (cranked up). For most narration the default works fine. Bump to 1.5 or 2.0 for dramatic moments (a battle cry, a grief explosion). Drop to 0.5 for subtle inflection (a hint of sadness under a brave face). Add it as an attribute: <mstts:express-as style="sad" styledegree="0.5">.
Azure caps `<break time>` at 10 seconds per tag. For longer silences, stack multiple breaks: <break time="10s"/><break time="10s"/><break time="5s"/> gives you 25 seconds. Or use mstts:silence at the top of your document to set a global between-sentence silence if you just want spacious pacing throughout.
Yes, on the Creator plan. You record 60 seconds of clean audio of yourself talking naturally (no background noise, normal pace), upload it from /dashboard?section=voice-clone, and the system trains a personal voice in about 5 minutes. Once cloned, your voice appears in the picker like any other voice. Creator includes 1 voice clone and 100,000 cloned characters per month. Quality scales with the recording — invest in 60 seconds of really clean audio and the clone will sound impressively close.
Different paths handle invalid SSML differently. Real-time TTS (Single, Dialogue) returns a clear error in the UI telling you what's wrong — usually with the line position. Audiobook batch is less forgiving — if one chunk has invalid SSML, that whole chunk fails and is skipped (the rest succeed). Test SSML in Single mode first if you're adding it to a long batch — much faster feedback loop.
Yes, three ways. Easiest: pick a Multilingual voice (look for "Multilingual" in the voice name) which can speak ~12 languages with one identity. Most precise: use <lang xml:lang="fr-FR">phrase</lang> inside any voice to switch languages mid-sentence for that span. Most flexible: use Dialogue mode and assign each language to a different voice — you get separate native speakers for each language.
Neural voices have a small amount of natural variation — the same text rendered twice won't be byte-identical even with the same voice and settings. Variation is normally subtle (different breath placement, slight rhythm differences). Larger differences usually come from context — the voice reads a question differently from a statement, an exclamation differently from a declaration. This is a feature, not a bug, and is part of what makes neural voices feel human.
A PRO/Creator feature that stores custom word → pronunciation mappings that auto-apply to ALL your generations. Add 'Camarath' → 'kəˈmɛrəθ' once, and every time you write Camarath in any Studio mode it gets pronounced correctly. Much cleaner than wrapping every instance in <phoneme>. Find it in the dashboard sidebar under Studio settings. Uses International Phonetic Alphabet (IPA) — there's a 'Hear it' button to audition before saving.
Azure's batch synthesis caps each chunk at ~10,000 chars to keep individual jobs fast. A 200K-char audiobook becomes ~20 chunks, each delivered as a separate MP3 in the ZIP along with SRT subtitles. To merge into a single MP3: extract the ZIP, then either (a) use Studio's Timeline mode to add each chunk and merge, or (b) use ffmpeg locally: `ffmpeg -f concat -i list.txt -c copy output.mp3` where list.txt contains "file 'chunk1.mp3'" lines. The ZIP keeps things flexible — chapter-level files are easier to edit and reorganize than one giant MP3.
Yes — that's exactly what Dialogue is for. Write your script as `Speaker: line of dialogue` (one per row). Assign each speaker a voice in the picker. Dialogue mode auto-generates the multi-voice SSML for you. Per-line style overrides are also supported via the small style dropdown next to each line. You only need to drop into raw SSML if you want effects beyond voice + style (custom pauses, prosody, etc.).
All three are generated from the same 24 kHz neural synthesis source. MP3 (default, 160 kbps) is universal and small. WAV is uncompressed PCM — same source quality, ~2.5x file size, ideal for re-editing in a DAW. OGG/Opus is the most modern, smallest at equivalent quality, but older browsers and devices don't support it. Pick MP3 unless you have a specific reason to use the others.
More from FreeTTS