Last updated: 2025-05-06

GPU Transcription

setting-model-gpu

The best device for running large models is a GPU. A good GPU can provide the best experience for various AI tasks.

Although Audio Note can use CPU for transcription, the speed and performance may be unsatisfactory. Therefore, using GPU acceleration for transcription is essential, as it can significantly improve processing speed and efficiency.

Audio Note's GPU acceleration can be achieved through three engines:

CUDA (Windows)
Vulkan (Windows)
CoreML (Mac)

This document will help you understand the relevant information about these three engines, supported GPU types, and how to choose the appropriate engine based on your hardware and needs.

GPU Engine Introduction

CUDA Engine

CUDA (Compute Unified Device Architecture) is a proprietary parallel computing platform and application programming interface (API) developed by NVIDIA. It allows software to use NVIDIA GPUs for general-purpose computing (GPGPU), providing direct access to GPU hardware to accelerate compute-intensive tasks.

Features:

Designed specifically for NVIDIA GPUs with optimized performance.
Supports multiple programming languages such as C, C++, Fortran, Python, and Julia.
Provides rich libraries and tools, suitable for users requiring high-performance computing.

Ideal for users who need high-performance GPU acceleration, especially those using NVIDIA GPUs.

Requirements:

Manufacturer: Only supports NVIDIA GPUs.
Model requirements: All NVIDIA GPUs starting from the G8x series, including GeForce, Quadro, and Tesla series.
VRAM requirements: Specific VRAM needs depend on the Whisper model used. Larger models require more VRAM, with at least 8GB recommended for the largest Whisper models.

Using the CUDA engine requires downloading CUDA-related runtime libraries. If not present locally, Audio Note will prompt to download them when switching to the CUDA engine.

Vulkan Engine

Vulkan is a high-performance, low-level compute API designed to provide direct access to GPU hardware, supporting efficient parallel processing. It is a cross-platform API that can run on various operating systems and hardware platforms.

Features:

Supports hardware from multiple GPU manufacturers, including NVIDIA, AMD, Intel, Samsung, and Qualcomm.
Platform-independent with strong compatibility, suitable for deployment in different hardware environments.
Provides fine-grained control over GPUs, ideal for users requiring flexibility and broad compatibility.

Use cases:

Suitable for users without NVIDIA GPUs but with other graphics accelerators, such as AMD GPUs.
Ideal for applications that need to run on multiple hardware platforms.

Requirements:

Manufacturer: Supports GPUs from multiple manufacturers, including NVIDIA, AMD, Intel, Samsung, and Qualcomm.
Model: Any device supporting Vulkan API 1.2+ with sufficient RAM and compatible Vulkan drivers.
VRAM requirements: Specific VRAM needs depend on the Whisper model used. Larger models require more VRAM, with at least 8GB recommended for the largest Whisper models.

CoreML Engine

CoreML is a machine learning framework developed by Apple, specifically designed for macOS, iOS, watchOS, and tvOS platforms. It allows developers to integrate machine learning models into Apple devices, leveraging the device's CPU, GPU, and Neural Engine for acceleration.

Features:

Designed specifically for Apple devices, fully utilizing Apple hardware performance.
Supports various machine learning models, including Whisper transcription models.
Provides efficient model loading and inference, suitable for running on Apple devices.
Intel series chips are supported but with average performance.
All M-series chips are supported.

Only suitable for users running on macOS systems who want to leverage Apple hardware acceleration.

Requirements:

Manufacturer: Only supports Apple devices.
Model: Supports Apple devices running macOS, including MacBook, iMac, Mac mini, etc.
Hardware requirements: Requires macOS 10.13 or later and a GPU supporting Metal.

How to Choose the Right GPU Engine

On Windows systems, you need to determine whether to choose the CUDA or Vulkan engine based on your GPU device, while Mac systems default to CoreML.

When both CUDA and Vulkan engines are available, the CUDA engine is recommended due to its superior performance.

For users with NVIDIA GPUs, the CUDA engine is recommended.
For users without NVIDIA GPUs, the Vulkan engine is recommended.
For users without a dedicated GPU, the Vulkan engine is recommended (when integrated graphics are available, enabling GPU usage will attempt to use integrated graphics for transcription).

Enabling Flash Attention

Flash Attention is an optimization technique designed to improve the computational efficiency and speed of the attention mechanism in Transformer models (including Whisper models). It achieves this by reducing memory usage and accelerating attention calculations, particularly significant on GPUs. Specifically, Flash Attention breaks down attention calculations into smaller chunks and performs them in the GPU's fast memory (SRAM), reducing access to main memory (HBM) and improving efficiency.

file-transcription

When using GPU for transcription, you can enable Flash Attention to improve model efficiency.

Flash Attention usage conditions:

Hardware dependency: Flash Attention is typically associated with GPUs, especially CUDA-supported GPU environments. It leverages GPU's parallel processing capabilities, so it may not be usable in non-GPU environments.
Software support: Requires model and library support. For example, in Whisper models, specific libraries like Flash Attention 2.0 may need to be installed to enable it.
Applicability: Flash Attention mainly applies to optimized Transformer models. It may not be available for unadapted models or environments.

Tests show that enabling Flash Attention can slightly improve transcription speed, but the specific improvement depends on the model, hardware, and environment.