Most TTS sounds robotic not because the technology is bad, but because the input is written for eyes, not ears. This guide covers voice selection, SSML tags, punctuation tricks, speed settings, and the mistakes that give AI voice away every time.
Free tier: 5,000 chars/month. SSML available on PRO ($19/mo) and Creator ($39/mo).
What you are fighting
TTS sounds robotic when the input text lacks natural rhythm, when the wrong voice is chosen for the content type, or when punctuation does not match how the words would be spoken aloud. All three are fixable without advanced tools.
Every sentence the same length. It reads like a list, not speech. Natural talking alternates short sentences with longer ones.
Text without commas or paragraph breaks runs together. The voice sounds rushed and breathless.
A cheerful voice reading a medical guide. A monotone voice reading a children's story. The mismatch undermines every word.
"15.2%" becomes "fifteen point two percent" sometimes. Or "one five point two percent sign." Depends on the engine.
Brand names, people's names, and technical terms trip up every TTS engine. The engine guesses from spelling and often guesses wrong.
Standard voices are built from concatenated speech segments. No amount of SSML fixes the fundamental mechanical quality.
Too slow: sounds like the voice does not understand what it is reading. Too fast: words blur together. Both feel unnatural.
The techniques, step by step
Apply these in order. The first three give you most of the improvement without any SSML. Steps four through seven add SSML control. Steps eight through ten are production workflow.
Neural voices are categorically different from standard TTS. Standard voices use concatenated speech segments and sound robotic regardless of what you do with speed or SSML. Neural voices model the entire prosodic pattern of speech at once. If your output sounds robotic and you have not checked whether you are using a neural voice, check that first.
Beyond neural vs standard, the voice character matters. A warm female voice at 0.9x carries emotional content like meditation guides and therapy worksheets. A crisp neutral male voice at 1.2x carries information-dense material like research summaries and legal briefs. A faster, energetic voice carries short-form content for social media. Mismatching voice character to content type is the most common mistake beginners make.
FreeTTS has a voice gallery where you can preview all 400+ voices at different speeds before generating. Spend two minutes there before you commit to a shoot.
The TTS engine does not know what you mean. It reads what you wrote. Punctuation is the only tool you have for controlling rhythm without SSML markup. A period creates a short full stop. A comma creates a brief pause within a thought. A question mark changes the inflection at the end of a sentence. Three periods (...) create a longer dramatic pause in many engines.
Write the way you want it to sound, not the way you would write an essay. Break long sentences into two. Add a comma where you want a breath. End with a question when you want the voice to rise. Read your text out loud before you generate. If it sounds wrong when you say it, it will sound wrong when the engine says it.
Speed affects naturalness more than any other single setting. Too slow and the voice sounds like it is reading cautiously from a script it does not understand. Too fast and words blur together. The natural speaking rate for conversational English is 130 to 150 words per minute, which is roughly 1.0x on most TTS engines.
But 1.0x is not always right. Long-form educational content sounds more approachable at 0.9x. Short-form punchy content for video sounds more alive at 1.1x to 1.3x. News and podcast content settles at 1.1x. Children's content works at 0.85x. The speed that sounds most natural is the speed your target audience would expect from a human presenter of that same content.
Silence is what makes speech sound human. Humans pause at the end of a thought, before a major point, and after an important word. TTS without pause tags can run sentences together and lose the breathing rhythm that makes spoken language comfortable to listen to.
Use short breaks (100-300ms) after commas and between clauses when you want a breath but not a full stop. Use medium breaks (400-600ms) between paragraphs or topics. Use long breaks (600-900ms) for dramatic effect before a key reveal or after a strong statement.
<speak> Welcome. <break "attr">time="400ms"/> Today we cover the three biggest mistakes. <break "attr">time="500ms"/> And how to fix all of them in under ten minutes. </speak>
Emphasis changes which word in a sentence carries the most weight. In natural speech, a skilled narrator emphasizes the word that changes the meaning. TTS without emphasis marks treats every word equally and sounds flat.
Use emphasis sparingly. One to three emphasized words per paragraph is usually right. Overusing it makes everything sound exclamation-marked. The strongest use cases are contrast (not X, but Y), key terms the listener needs to remember, and warnings or important caveats.
<speak> The answer is <emphasis "attr">level="moderate">not</emphasis> the settings. It is the voice you chose <emphasis "attr">level="strong">before</emphasis> you adjusted anything. </speak>
Prosody controls pitch (how high or low the voice sounds), rate (how fast a specific phrase is read), and volume. The key to using it well is applying it locally to a word or phrase, not globally to the entire script. Global rate changes just make everything uniformly fast or slow. Local rate changes create natural variation.
For technical terms or unusual phrases, slow the rate to 80-90% for that phrase only. For exciting announcements, raise pitch slightly by +5% or +10%. For serious warnings, lower pitch by -5% and lower volume slightly for emphasis by contrast.
<speak> Next we get to the <prosody "attr">rate="85%">most complex part</prosody>. Take your time with this one. <prosody "attr">pitch="+8%" "attr">rate="110%">And here is the exciting bit:</prosody> it works automatically. </speak>
Brand names, people's names, technical jargon, and foreign words are where every TTS engine stumbles. The engine guesses pronunciation from spelling, and when the spelling does not match the sound (which is extremely common in English), you get something wrong.
SSML phoneme tags let you specify the exact pronunciation using IPA (International Phonetic Alphabet) notation. You do not need to know IPA fluently. You just need to know the target pronunciation and look up the IPA characters for it. There are free online IPA generators for English words.
<speak> Welcome to <phoneme "attr">alphabet="ipa" "attr">ph="friː.tiː.es">FreeTTS</phoneme>. Today we discuss <phoneme "attr">alphabet="ipa" "attr">ph="prɒsədi">prosody</phoneme> and how it changes the listener experience. </speak>
Numbers are a consistent pain point. Write the way you want the engine to speak: write "fifteen point two percent" not "15.2%". Write "two thousand and twenty six" not "2026" if you want it to sound that way. Write "doctor" not "Dr." if you want the full word.
The same applies to acronyms. "AI" reads as two letters A-I in most engines. "NASA" reads as a word. If you want an acronym spoken as individual letters, add periods between them: "A.I." tells the engine to say each letter. If you want an acronym spoken as a word, spell it out phonetically.
Scripts over 500 words need structural breathing points. These are moments in the text where a human reader would take a genuine breath and reset attention. In written text these happen at paragraph breaks. In spoken text they happen at transitions between ideas.
Mark your breathing sections with a blank line in the script and a medium-length break tag before and after. For very long content (1,000+ words), add a 1-second break at the major section transitions. It sounds like the narrator gathering themselves before a new topic. This is exactly what a podcast host sounds like, and it is the single simplest thing that separates professionally-produced audio from amateur TTS output.
Generate the first 200 words and listen before you generate the full 2,000-word article. The first paragraph tells you whether the voice, speed, and structure are working. Listening to 30 seconds of output is a much faster feedback loop than generating the whole script and discovering the voice is wrong on word 1,800.
Also test the tricky parts specifically: the brand names, the acronyms, the numbers, the unusual words. Generate those phrases in isolation and listen. Fix the pronunciation before you build the whole piece around a voice that says your brand name wrong every time.
Speed by content type
These are starting points. Your audience, the density of the information, and your specific voice choice all affect the right speed. Test the first 60 seconds of your content at the recommended speed, then adjust up or down by 0.05x increments.
| Content type | Recommended speed | Notes |
|---|---|---|
| Meditation / grounding script | 0.85x | Slow creates space for the listener to follow |
| Children's educational content | 0.85x to 0.9x | Clarity over pace for young listeners |
| Therapy worksheets | 0.9x to 1.0x | Warm and unhurried for emotional content |
| Standard narration | 1.0x to 1.1x | Baseline conversational speech rate |
| Podcast / educational audio | 1.05x to 1.15x | Slightly faster feels produced, not generated |
| News / commentary | 1.1x to 1.3x | Authoritative pace for dense information |
| Short-form video voiceover | 1.2x to 1.4x | Punchy and keeps attention |
| Familiar content re-listen | 1.5x to 2.0x | Comprehension holds when you already know the material |
Side by side
The same information, written two ways. One sounds human. One sounds like a report was pasted into a TTS engine.
Common questions
More FreeTTS guides
Last updated May 2026. SSML specification references: W3C Speech Synthesis Markup Language 1.1. Neural TTS quality research: Microsoft Azure Cognitive Services voice documentation. Speaking rate research: Tauroza and Allison (1990). Related guides: TTS for therapists, TTS for YouTube, TikTok voiceover.