Free online speech-to-text
Free speech to text online — convert audio & voice to text
FreeTTS Speech to Text is a free online speech-to-text tool that converts audio, voice, and video into accurate, editable text. Press record and watch words print onto the page the instant you speak, or upload a file and get a refined, word-timestamped transcript in seconds. There’s no sign-up and no credit card — the free plan covers 90 minutes a month, and you can copy your text or export TXT, SRT, and VTT subtitles with no watermark. It runs in your browser, transcribes in 75+ languages with auto-detect, and when you’re done you can send the transcript straight to FreeTTS Studio to re-voice it.
FreeTTS transcribes up to 90 minutes of audio per month for free, in sessions of up to 15 minutes each, with no account and no credit card required. TXT, SRT, and VTT export are free, with no watermark on any output.
FreeTTS uses a two-pass engine: an instant in-browser preview while you speak, then an high-accuracy cloud speech-recognition pass on stop that adds punctuation, casing, and word-level timestamps. Interim words stay in your browser and are never uploaded.
FreeTTS supports more than 75 languages and dialects with automatic language detection. Paid plans add speaker labels: PRO is $19.99/mo for 20 hours per month and Creator is $39.99/mo for 100 hours per month.
What is FreeTTS Speech to Text?
FreeTTS Speech to Text is the audio-to-text converter built into FreeTTS, the free web-based text-to-speech and transcription platform at freetts.org. It turns spoken audio into written text two ways: live, as you dictate into your microphone, and from a recording you upload. The result is a clean, punctuated transcript you can edit inline, search, copy, or export.
The thing that separates it from a basic dictation box is the timing. Every word in your transcript carries a precise start and end time, so the text is linked to the audio. Click a word and the recording seeks to that exact moment. Export those timings as SRT or VTT and you have broadcast-ready subtitles. That makes FreeTTS equally useful as a voice-to-text notepad, an MP3-to-text converter, and a subtitle generator — without juggling three different tools.
Most browser dictation tools stop at a stream of lowercase words with no punctuation and no way to fix them. FreeTTS goes the other way: the transcript is a real document. It’s contenteditable, so you can correct a misheard name or split a run-on sentence in place, and the edit sticks when you export. Nothing installs, nothing has to be configured, and there’s no queue — the audio you record or drop in is the audio that gets transcribed, in the same tab, while you watch.
How to transcribe audio to text in three steps
Record or upload
Press record to dictate live, or drop in an audio or video file — MP3, WAV, M4A, MP4, and more. The in-browser engine starts printing interim words immediately, so you see it working from the first syllable.
Pick a language (or don’t)
Leave it on auto-detect and FreeTTS identifies the spoken language for you, or pin one of 75+ languages from the picker on the console. Smart punctuation is on by default in every language.
Export or re-voice
On stop, the accurate pass replaces the preview with the canonical transcript. Copy it, export TXT/SRT/VTT, or send it to FreeTTS Studio to re-synthesize it in a different voice or language.
Transcribe in 75+ languages with auto-detect
FreeTTS transcribes in more than 75 languages and regional variants, spanning every major language family you’re likely to record. From the Germanic group there’s English, German, Dutch, Swedish, Norwegian, and Danish; from the Romance group, Spanish, French, Italian, Portuguese, and Romanian; Slavic covers Russian, Ukrainian, Polish, and Czech; and the set reaches well beyond Europe into Arabic, Hebrew, Turkish, Hindi, Bengali, Tamil, Thai, Vietnamese, Indonesian, Japanese, Korean, and both Mandarin and Cantonese Chinese.
Accents are handled at the variant level, not just the language level. English alone is split into US, UK, Australian, Indian, and Irish recognition models; Spanish distinguishes Spain from Mexico and the wider Latin-American region; French separates France from Canada; and Portuguese separates Brazil from Portugal. Picking the right variant from the console matters — an Indian- English speaker transcribes far more accurately against the Indian-English model than against the US default. If you don’t know which one a clip is in, or if speakers switch mid-recording, leave it on auto-detect and FreeTTS identifies the spoken language from the audio itself.
Every language gets the same treatment: correct casing, native punctuation (including right-to-left scripts like Arabic and Hebrew, and the spacing rules of CJK), and word-level timestamps. That means a Japanese interview or a Spanish lecture exports clean SRT captions exactly the way an English podcast does — no transliteration step, no extra setup.
How accurate is it, and how does it work?
Accuracy comes from a two-pass design, and it’s worth understanding why there are two passes instead of one. The first pass is the live preview: a speech-recognition engine running inside your browser paints interim words with near-zero latency and zero upload. It’s optimized for speed, not perfection — it guesses early, revises as more of a word arrives, and skips punctuation. That’s exactly what you want while you’re still talking, because it confirms the mic is working and lets you follow along, but it’s a draft.
The second pass runs the moment you press stop. The captured audio is sent once to an accurate high-accuracy speech-recognition engine that re-transcribes the whole clip with the full context of the recording in front of it — so it can disambiguate homophones from surrounding words, restore capitalization and sentence punctuation, and align every word to a start and end time in the audio. When it returns, a gold sweep crosses the transcript and the canonical result replaces the rough preview in place. Because the second pass sees the entire clip rather than a moving window, it is consistently more accurate than the live draft it overwrites.
Like every speech-to-text system, accuracy is highest on clear audio in a supported language and drops on noisy recordings, heavy crosstalk, distant or echoey microphones, and very strong accents — which is the single biggest reason to pin the correct language variant before you record. We’d rather tell you that than promise a magic percentage. The upside of an editable, time-linked transcript is that fixing the rare slip takes a couple of seconds: click the word that looks wrong, the audio seeks to that exact moment so you can hear what was actually said, then type the correction inline.
File formats & exports — SRT vs VTT vs TXT
A transcript is only useful in the format the next tool expects, so FreeTTS exports three, plus copy-to-clipboard — all free, on every plan, with no watermark. Knowing which one to pick saves a round trip:
TXT — the plain transcript
Just the words, with punctuation and paragraph breaks, no timing. This is what you want for show notes, an article draft, meeting minutes, a blog post, or anything you’ll paste into a doc. It’s the most portable export and opens in any editor.
SRT — subtitles for video
SubRip captions: numbered cues with start/end timecodes (HH:MM:SS,mmm) and the line of text under each. SRT is the format YouTube, Vimeo, Premiere, and most video editors accept directly, so it’s the default for captioning a finished video.
VTT — subtitles for the web
WebVTT is the HTML5 <track> standard — the captions a browser reads for an embedded <video>element. It’s close to SRT but uses a dot before milliseconds and supports styling cues, so reach for VTT when you’re shipping captions on your own site.
The practical rule: TXT when you need words, SRTwhen you’re uploading to a video platform, VTT when the player is a web page. All three are generated from the same word-level timestamps, so the captions are tightly synced to the audio rather than guessed from sentence length — and because the transcript is editable before you export, any correction you make is baked into the subtitle file too.
Free vs PRO vs Creator
The free plan is a real tool, not a teaser. Here’s exactly what each tier gets, so you can pick without guessing.
| Feature | Free | PRO — $19.99/mo | Creator — $39.99/mo |
|---|---|---|---|
| Monthly transcription | 90 minutes | 20 hours | 100 hours |
| Max session length | 15 minutes | Long-form | Long-form |
| Live in-browser preview | Yes | Yes | Yes |
| Accurate pass + timestamps | Yes | Yes | Yes |
| Export TXT / SRT / VTT / copy | Yes | Yes | Yes |
| Speaker labels (diarization) | — | Yes | Yes |
| Account required | No | Yes | Yes |
If you transcribe the odd voice memo or a short clip for captions, Free is plenty. If you run interviews, meetings, or podcasts and need to know who said what, PRO adds speaker diarization and 20 hours a month. If transcription is a daily part of your workflow, Creator’s 100 hours is the volume tier. No cancellation fees, no hoops.
Who uses it, and for what
Creators & podcasters
Drop in a finished episode and get a transcript in one pass: export SRT to caption the YouTube cut, copy TXT into the show notes, and skim the text to pull the three quotable lines for the audiogram. A 40-minute episode that used to be an evening of scrubbing becomes a few minutes. With speaker labels on PRO, two-host banter and guest interviews stay attributed line by line, so the transcript reads like a script instead of a wall.
Students & researchers
Record a lecture or a one-on-one interview and walk out with a searchable transcript. When you’re writing the paper, search the text for the term you half-remember, then click the word to jump the audio back to that moment and confirm the exact wording before you quote it. The timestamps double as citations — “[12:04]” points anyone reviewing your work straight to the source.
Journalists
Turn an interview recording into text fast enough to file on deadline, then verify any line by clicking it to seek the audio — no rewinding by ear. The live preview never leaves your browser, and uploaded audio isn’t used to train models, which matters when a source spoke on condition you’d protect the tape.
Accessibility & compliance
Generate accurate SRT/VTT captions for video, courses, and recorded meetings so content is usable by deaf and hard-of-hearing audiences. Captions and transcripts are also what WCAG and the ADA expect for time-based media, so the same export that widens your audience also closes a compliance gap — and the text feeds search engines that can’t hear audio.
The common thread is that the work doesn’t end at the transcript. Because the text is editable and time-linked, every one of these jobs — captioning, quoting, fact-checking, publishing — happens against the same document instead of three exported copies that drift out of sync.
Your audio stays private
The live preview runs entirely in your browser — interim words are never uploaded anywhere. Audio is only sent for the accurate pass when you press stop, and the microphone stream is released the moment recording ends. Transcripts aren’t sold, and your audio is not used to train models. For dictation you never want to leave the device at all, the in-browser preview alone gives you usable text without a single upload.
FreeTTS vs Otter, Descript, Rev & Whisper
We own FreeTTS, so take the bias as read — but here’s an honest read on where each fits.
| Tool | Free minutes | Sign-up | In browser | Captions (SRT/VTT) | Best for |
|---|---|---|---|---|---|
| FreeTTSOur pick | 90 min/mo | Not for free use | Yes | Free | Quick free audio/voice/MP3 to text + captions, then re-voicing |
| Otter.ai | Limited monthly minutes | Required | Yes | Limited | Live meeting notes & team collaboration |
| Descript | Trial only | Required | App download | Yes (paid) | Editing audio/video by editing the transcript |
| Rev | Mostly paid | Required | Yes | Yes | Human-grade accuracy when you’ll pay per minute |
| Whisper (OpenAI) | Free, self-hosted | N/A (you run it) | Local setup | DIY | Developers comfortable running a model locally |
The honest read on each: Whisperis an excellent open model, but “free” there means installing Python, downloading multi-gigabyte weights, and wiring up your own SRT export — great for developers, a non-starter if you just have a recording and a deadline. Otter is built around live meeting notes and team collaboration, so its real strength is the workspace, not one-off file conversion. Descript shines when transcription is a step inside editing audio or video — you edit the media by editing the words — but that power comes as a desktop app and a paid plan. Revleans on paid, near-human accuracy, billed per minute, which is the right call when a court transcript can’t be 98% right.
FreeTTS sits in the gap all four leave open: the tool you open when you just want to convert speech, an MP3, or a video to text right now— free, in the browser, no sign-up, no install — and walk away with editable text plus synced subtitles. And it’s the only one of the five that also turns text back into speech, which is the next section.
The speech-to-text ↔ text-to-speech loop
FreeTTS is unusual in closing the loop both ways. Speech-to-text turns a recording into editable text; text-to-speech turns that text back into clean audio — and because both live on the same platform, you can go full circle without exporting between apps. Transcribe a recording here, fix the wording, then send the transcript to FreeTTS Studio and re-synthesize it in any of 300+ HD voices.
That sounds abstract until you have a use for it. A few that come up constantly: clean up a recording that was mumbled or noisy by transcribing it, correcting the text, and re-voicing it in a crisp studio voice; localize a podcast by transcribing the English, translating the text, and generating the same script in a native voice for another market; or swap a narrator entirely without re-recording a single line. It’s the same engine behind our free text-to-speech tools and the PDF to Audiobook converter, so speech in becomes polished speech out in one place.
Frequently asked questions
Ready to transcribe?
No sign-up, no credit card. Record or upload, and export your text in seconds.