Everything you need to know about turning text into natural sounding audio. The history, the technology, the voices, and the languages. All in one place.
Skip the reading and go straight to converting. 400+ voices, 75+ languages, free MP3 download. Takes about 10 seconds.
Open FreeTTS →Text to speech didn't start with Siri or Alexa. It didn't even start with computers. The idea of making machines talk has been around for over 250 years. And the journey from mechanical bellows to neural networks that sound like actual humans? It's wilder than most people realize.
A Hungarian inventor built a mechanical device using bellows, reeds, and a rubber mouth to simulate human speech. It could produce vowels and some consonants. Creepy? Absolutely. Groundbreaking? Also absolutely.
Demonstrated at the 1939 World's Fair. An operator used a keyboard and foot pedals to control electronic circuits that produced speech sounds. It took months of training to operate and sounded like a robot having an existential crisis. But it proved electronic speech was possible.
Noriko Umeda in Japan created one of the first rule based computer TTS systems. That same year, HAL 9000 in "2001: A Space Odyssey" gave people a very specific idea of what computer speech should sound like. Reality was much less terrifying. And much less clear.
Ray Kurzweil built a machine that could scan printed text and read it aloud for blind users. It was the size of a washing machine and cost $50,000. But it worked, and it genuinely changed lives. Stevie Wonder was one of the first customers.
The original Mac shipped with MacinTalk, built in text to speech. "Hello, I'm Macintosh" was the first time most people heard a personal computer talk. The voice quality was terrible by today's standards, but in 1984 it felt like science fiction.
The dominant approach for a decade. Record a human saying thousands of syllables, then stitch those recordings together to form words and sentences. Better than pure synthesis, but the joins between segments were audible. Every word sounded slightly disconnected from the next.
This is the one that changed everything. Google DeepMind published WaveNet, a neural network that generates raw audio waveforms sample by sample. The quality jump was enormous. For the first time, machine generated speech sounded genuinely human.
Google's Tacotron and Microsoft's FastSpeech solved the speed problem. Neural TTS could now run in real time. This is when neural voices started appearing in consumer products. Google Assistant, Alexa, and Siri all upgraded their voice engines during this period.
Today's neural voices handle context, emotional nuance, multilingual code switching, and natural prosody at a level that's genuinely difficult to distinguish from human speech in blind tests. The technology that cost millions to develop a decade ago is now available for free. Which is exactly why FreeTTS exists.
Not all TTS voices are created equal. In fact, they're created in totally different ways, and understanding the differences helps you pick the right one for what you're trying to do.
The old guard. Built by recording a person saying predetermined words and syllables, then stitching those recordings together. They work, but they sound like a GPS from 2008. You know the type. Every word lands with the same flat energy regardless of context.
The current standard. Trained on thousands of hours of human speech using deep learning. They predict pitch, rhythm, and emphasis dynamically for each sentence. These are the voices that make people do a double take because they genuinely sound human. FreeTTS uses exclusively neural voices.
A single voice model that speaks multiple languages naturally. Switch from English to Spanish to French without changing the voice. The accent and pronunciation adjust automatically. Useful for content creators targeting international audiences.
Neural voices trained in specific speaking styles. Newscast, conversational, cheerful, empathetic, whispering, shouting. Same base voice, different delivery. American English on FreeTTS has the most style variety.
A synthetic copy of a specific person's voice created from audio samples. Not the same as standard TTS. Voice cloning is personalized and raises serious ethical questions about consent and deepfakes. FreeTTS doesn't offer cloning. We use pre trained, licensed neural voices instead.
Within the same language, you get different regional flavors. English alone has American, British, Australian, Indian, South African, Irish, and more. Spanish has European and Latin American variants. Portuguese has Brazilian and European options.
When you generate speech, the output needs to go somewhere. Different formats serve different purposes, and picking the wrong one can mean unnecessary headaches down the line.
| Format | Quality | File Size | Best For |
|---|---|---|---|
| MP3 | Good (lossy compression) | Small (~1MB per minute) | Universal playback. YouTube, podcasts, websites, presentations. Plays on everything. |
| WAV | Lossless (uncompressed) | Large (~10MB per minute) | Professional audio editing. Use when you need to process, mix, or layer audio without quality loss. |
| OGG | Good (lossy, open source) | Small (~0.8MB per minute) | Web applications, game audio, open source projects. Not universally supported on Apple devices. |
| FLAC | Lossless (compressed) | Medium (~5MB per minute) | Archival and high fidelity playback. Same quality as WAV but half the file size. |
| SRT | Text (subtitles) | Tiny (~2KB per minute) | Video captions, accessibility. Pairs with audio for synchronized subtitles. |
What FreeTTS outputs: Every generation gives you MP3 audio plus an SRT subtitle file. MP3 plays on literally everything. SRT drops straight into any video editor for instant captions. Two files, one click, zero compatibility headaches.
The voice is only half the equation. The other half is what you feed it. Here's how to write text that TTS handles beautifully.
Commas create micro pauses. Periods create full stops. Ellipses create dramatic pauses. Question marks change intonation. Neural voices read punctuation, not just words. Use it deliberately.
Formal academic writing sounds weird when read aloud. Short sentences mixed with longer ones. Fragments sometimes. Questions followed by answers. Conversational flow beats perfect grammar every time.
"Dr." might be read as "Doctor" or "Drive" depending on context. "$5M" might get mangled. When in doubt, spell it out: "five million dollars." Remove ambiguity before the AI has to guess.
A cheerful voice reading a eulogy? Awkward. Spend a minute picking a voice that fits the tone. It makes more difference than you'd expect.
Accessibility content works best at 0.75x. Video narration at 1x. Quick explainers at 1.25x. Audiobook style content at 0.9x. There's no universal best speed.
The free tier supports 1,000 characters per generation (PRO goes up to 10,000). Splitting by paragraph or section gives you more control. You can use different speeds for different parts.
Preview every single time. Your eyes will miss problems your ears catch instantly. Weird emphasis on the wrong word. An abbreviation read letter by letter. Three seconds of listening saves hours.
The first voice you try isn't always the best one. Generate the same paragraph with three or four different voices. You'll be surprised how different the same text sounds.
If you're building an app, plugin, or service that needs voice output, here's an honest comparison of the major APIs.
| Provider | Voices | Languages | Pricing (per 1M chars) | Free Tier |
|---|---|---|---|---|
| Microsoft Azure | 400+ | 140+ | $16 (neural) | 500K chars/month free |
| Google Cloud TTS | 220+ | 40+ | $16 (WaveNet) | 1M chars/month free (standard) |
| Amazon Polly | 60+ | 30+ | $16 (neural) | 5M chars/month for 12 months |
| OpenAI TTS | 6 | ~57 | $15 (tts-1) / $30 (tts-1-hd) | None |
| ElevenLabs | Unlimited (cloning) | 29+ | ~$18 (estimated) | 10K chars/month |
For developers who are still in the prototyping phase: use FreeTTS to test voices, languages, and user flows before committing to a paid API.
75+ languages, each with multiple voices. Pick one to see available voices and start generating.
Not the same questions as everywhere else. These are the ones people actually want answered.
This page covers the technology, history, and practical side of text to speech. For specific use cases and deeper dives, check out these guides:
400+ neural AI voices. 75+ languages. Free MP3 and SRT downloads. 1,000 chars per generation, 5,000 chars/day on the free tier. Just go.
Open FreeTTS →