📚 Documentation
Last updated: 2026-02-08

Concepts

TODO (Screenshot Replacement): Model settings page (App 2.0) Include: model family tabs (Whisper/Realtime), download status, default model selector, and GPU engine entry. Suggested filename: settings-models-v2-en.png

Scope

Audio Note model capabilities are organized into three groups:

  • Official Whisper models
  • Community models
  • Realtime models

Model selection controls recognition quality and speed, but not business workflows such as link download or watch scheduling.

Use Cases

  • Low-end devices: realtime models or Whisper Tiny/Base
  • Balanced quality and speed: Whisper Small/Medium
  • Maximum quality: Whisper Large-v3 / Large-v3-Turbo
  • English-only throughput: .en and Distil English variants

Steps

  1. Open Settings > Transcription.
  2. Choose a model family for your workload.
  3. Download models to a stable local path.
  4. Run a baseline sample in your real scenario.
  5. Tune model size or advanced parameters based on quality/latency results.

Official Whisper models (current built-ins)

  • Tiny / Tiny English
  • Base / Base English
  • Small / Small English
  • Medium / Medium English
  • Large-v2
  • Large-v3
  • Large-v3-Turbo

Community models (current built-ins)

  • Distil Small English
  • Distil Medium English
  • Distil Large V2 English
  • Distil Large V3

Realtime models (current built-ins)

  • Sherpa ncnn: Chinese-English / Chinese / English / French
  • Sherpa ONNX: Chinese-English / Chinese / English / French / Russian / Korean / Japanese

Built-in model lists may change by version.

Term Explanations

  • Tiny / Base / Small / Medium / Large: model size tiers; larger tiers usually improve quality but increase latency/cost.
  • Turbo: an optimized variant balancing speed and quality for larger models.
  • .en variants: English-optimized models with better speed/quality tradeoff for English-only workloads.
  • Community models: community-trained variants; validate with your own samples before production use.

See also: Model Usage Recommendations.

Real Scenario: From “Works Once” to “Works Every Day”

A common rollout mistake is selecting a large model for everyone on day one, then hitting long wait times and low adoption. A more practical path is:

  1. Start with Small/Medium as a team baseline.
  2. Use Large-v3/Turbo only for quality-critical outputs.
  3. Document model presets so teammates can reproduce results.

This keeps daily throughput stable while still preserving a high-accuracy lane for important assets.

FAQ

Q: Are realtime models always faster than Whisper?
A: Usually in low-latency scenarios, but actual speed still depends on hardware and runtime settings.

Q: Are community models always more accurate?
A: Not always. They are often optimized for specific languages/domains and must be validated with your data.

Q: Is larger always better?
A: Larger models often improve quality but significantly increase compute and memory cost.

Common Mistakes

  • Mistake 1: Choosing by model size only.
    Fix: choose by workflow first (realtime vs offline accuracy), then tune size.
  • Mistake 2: Applying one heavy model to every machine.
    Fix: keep a baseline + escalation strategy to balance reliability and quality.
  • Mistake 3: Skipping real-sample validation.
    Fix: run A/B checks with your own audio before changing team defaults.

Limitations

  • Status: Stable (non-Beta), with potential model-list updates in future releases.
  • Availability depends on app version and subscription entitlements.
  • Platform-specific acceleration differs: CUDA/Vulkan on Windows, CoreML on macOS.
  • Large models require substantial disk, memory, and often GPU resources.
Whisper-Powered Live Transcription: Capture Speech from Mic, Apps & Media Files in Real Time

Contact us

Email
Copyright © 2026. Made by AudioNote, All rights reserved.