Last updated: 2025-05-08

Advanced Parameter Transcription

file-transcription

Audio Note is built upon OpenAI Whisper, thus it supports various original Whisper parameters, allowing users to fine-tune transcription results according to specific needs. This document, based on analysis of official documentation, GitHub repositories, and community discussions, details each parameter's meaning, its impact on transcription quality, and scenarios where adjusting these parameters can be beneficial.

Suppress Non-Speech Tokens

Suppress output of non-speech tokens such as silence, noise, and musical melodies

When transcribing songs or melodies, enabling this option can reduce unnecessary non-speech content generated during silent parts.

Prompt

Provide context or guide the model's generation process, such as names, locations, etc.

Initial prompt tokens provided to the decoder, guiding the model to learn your prompt style and generate similar output.

A useful application of prompts is to improve recognition of unknown words (e.g., names and technical terms). You can provide special nouns from the audio as prompts, and the model will automatically learn and use these nouns during transcription.

Note that prompts here are different from ChatGPT prompts. Here, prompts are initial inputs to the model, while ChatGPT prompts provide context. For example, if your prompt is Please output in markdown format, the Whisper model won't comply as it follows the style of your prompt rather than any instructions within it.

For more detailed usage of prompts, refer to the Whisper Prompting Guide.

Decoding Strategy

In speech recognition systems, decoding strategy refers to the method of generating final text sequences from the model's output probability distribution. Whisper is a Transformer-based speech recognition model.

Audio Note supports two decoding strategies:

Greedy Decoding: At each decoding step, selects the token with the highest probability as output, then continues predicting the next token based on this.
Beam Search: At each decoding step, maintains multiple candidate paths (determined by beam width), and selects the path with the highest cumulative probability as final output.

Both strategies have their pros and cons. Greedy decoding is simple and fast but may generate incorrect text sequences, while beam search can produce more accurate text sequences at higher computational cost.

How to choose:

Speed priority: Choose greedy decoding if you need fast text generation.
Accuracy priority: Choose beam search if you need more accurate text sequences.

Maximum Context

Controls the context length of audio input.

Adjustment impact:

Smaller values reduce text repetition and speed up transcription but may affect performance on complex audio.
Larger values improve coherence and accuracy but increase memory and computational requirements.

Sometimes adjusting this value can have different effects:

For long audio, reducing maximum context can decrease repeated text but may lead to incoherent text.
For short audio, increasing maximum context can improve accuracy but may increase memory and computational requirements.

Reducing maximum context (e.g., to around 64) may help reduce hallucinations and repetitive text.

No-Speech Threshold

Speech probabilities below this threshold are considered non-speech and skipped. Default value is 0.6.

Adjustment impact:

Higher threshold: More speech is recognized as noise and skipped, suitable for audio with significant noise.
Lower threshold: More speech is transcribed, including background noise, suitable for audio with minimal noise.

Specific thresholds should be experimentally adjusted based on audio quality and application scenarios for optimal results.

Length Limit

Limits the output length of each paragraph, useful for controlling text length and avoiding overly long paragraphs.

Paragraphs exceeding this length will be truncated. Enable as needed.

Entropy Threshold

Entropy is a concept in information theory, representing the uncertainty of a random variable.

This value measures the uncertainty of the speech recognition model's output. In speech transcription, the model generates possible text results from input audio, and the entropy threshold helps determine if these results are "trustworthy" enough. Simply put, it decides whether the model should output transcription results when facing uncertainty.

Adjustment impact:

High entropy threshold: Allows transcription results even with high uncertainty, potentially generating less accurate text.
Low entropy threshold: Only outputs transcription when very confident, improving accuracy but potentially skipping parts if audio quality is poor.

For clear audio, use lower values to ensure more accurate output. For low-quality audio, increase this value to transcribe more content, even at the cost of some accuracy.

Log Probability Threshold

Sets the minimum probability for tokens. Tokens below this value are ignored.

During transcription, the model calculates a probability for each possible word. The log probability threshold determines which words are kept and which are ignored, directly affecting output quality.

Adjustment impact:

High log probability threshold: Requires high probability for each word, reducing errors but potentially resulting in incomplete output.
Low log probability threshold: Allows words with lower probability, resulting in more complete transcription but potentially introducing errors.

In serious scenarios like legal or medical transcription, this value can be appropriately increased, but not too high, as some words might be filtered out if their probability doesn't meet the threshold.

Temperature

Temperature parameter affects decoding randomness. Higher temperature increases randomness; lower temperature makes output more conservative and deterministic.

Adjustment impact:

Low temperature (e.g., 0.0) makes output more deterministic, suitable for standard transcription.
High temperature (e.g., 1.0) introduces more randomness, suitable for tasks requiring diversity.

There's also a temperature increment setting that controls dynamic temperature adjustment during decoding.

If the model generates random or incoherent text during non-speech parts, lowering the temperature can improve output reliability.