Word Error Rate (WER)

Word Error Rate (WER) is a common metric of the performance of a speech recognition or machine translation system.

Understanding Word Error Rate (WER)

Word Error Rate (WER) is the industry-standard metric for measuring the accuracy of Automatic Speech Recognition (ASR) systems. It provides a numerical value that represents how much the machine-generated transcript deviates from a perfect "ground truth" transcript created by a human. The lower the WER, the more accurate the transcription system is.

How is WER Calculated?

The calculation of WER is based on the Levenshtein distance, which counts the minimum number of operations required to transform the AI transcript into the reference transcript. There are three types of errors considered:

Substitutions (S): When a word is replaced by a different word (e.g., "hear" instead of "here").
Deletions (D): When a word is missing from the AI transcript.
Insertions (I): When an extra word is added to the AI transcript that wasn't in the original audio.

The formula is: WER = (S + D + I) / N, where N is the total number of words in the reference transcript.

What is a Good WER?

A "good" WER depends on the context. In ideal conditions—clear audio, single speaker, no background noise—modern AI models can achieve a WER of less than 5%, which is comparable to human performance. In noisy environments or with heavy accents, the WER might rise to 15-20%. For most business and educational purposes, a WER below 10% is considered excellent and highly usable without significant editing.

Why WER Matters for You

When choosing a transcription service like Libraryminds, understanding WER helps you evaluate the quality you'll receive. High WER means you'll spend more time correcting errors, while low WER allows you to immediately use the transcript for search, summarization, or study. We continuously benchmark our AI providers to ensure we are delivering the lowest possible WER to our users across different languages and audio qualities.

Frequently Asked Questions

Is 0% WER possible?

In practice, 0% is extremely rare because even humans disagree on exact wording in fast speech. 3-5% is generally considered the 'human parity' threshold.

Does a high WER mean the transcript is useless?

Not necessarily. Even with a 15% WER, the transcript is often perfectly adequate for keyword search and getting the general gist of a conversation.

How can I improve the WER of my recordings?

Using a high-quality microphone, reducing background noise, and speaking clearly are the most effective ways to lower the error rate.

Build your video knowledge base

Turn any video into searchable text and permanent insights with Libraryminds.

Start for Free →