A tryst with Claude Code to get whisper.cpp to run on an iGPU

Aaron — Wed, 27 May 2026 12:10:07 GMT

I have a lot of meeting recordings that need transcribing. Cloud services work, but sending audio to a third party felt unnecessary when my laptop has a perfectly capable Intel Iris Xe GPU sitting mostly idle. So I built a local transcription pipeline using whisper.cpp with OpenVINO acceleration — and wrapped it in a small Flask web app so I don't have to touch the terminal every time.

This is a writeup of how it works, what broke along the way, and a few implementation decisions worth documenting.

Why whisper.cpp + OpenVINO

whisper.cpp is a C++ port of OpenAI's Whisper model. It's fast, runs entirely offline, and supports OpenVINO as a backend — which means it can use Intel's iGPU for inference via the -oved flag. On a 7-minute recording I was getting around 7 minutes on CPU and under 3 on the iGPU. Not earth-shattering but meaningful for batch work.

The catch is that getting OpenVINO to actually see the GPU takes a bit of setup.

Getting OpenVINO to see the GPU

Three things needed fixing before ov.Core().available_devices would report anything beyond CPU:

1. Level Zero registry. The Intel GPU runtime depends on Level Zero being properly registered with the system. If it's missing or misconfigured, OpenVINO silently falls back to CPU with no error.

2. LD_LIBRARY_PATH must be a conda env variable, not a shell variable. This one is subtle. Setting it in .bashrc or before running a command doesn't carry through conda run. It has to live inside the environment itself:

conda env config vars set LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu -n openvino

3. libstdcxx-ng upgrade. The default version bundled with conda was too old for the OpenVINO shared libraries. Upgrading it inside the env fixed the import errors.

The Transcription Script

With the environment working, transcribe.sh handles the full pipeline: detect available devices, extract audio from whatever file you hand it, run whisper.

Device detection runs at startup and falls back to CPU if the GPU isn't available:

AVAILABLE_DEVICES=$(conda run -n openvino python -c \
    "import openvino as ov; print(','.join(ov.Core().available_devices))" \
    2>/dev/null | grep -v WARNING | grep -v overwriting | grep -E "^[A-Z]")

if echo "$AVAILABLE_DEVICES" | grep -q "GPU"; then
    OVED_FLAG="-oved GPU"
else
    OVED_FLAG="-oved CPU"
fi

Audio extraction via ffmpeg handles both audio and video input — the -vn flag strips the video track and the -ar 16000 -ac 1 flags resample to the 16kHz mono WAV that Whisper expects:

ffmpeg -i "\(INPUT_FILE" -vn -ar 16000 -ac 1 -c:a pcm_s16le "\)WAV_FILE" -y

Then whisper runs through conda:

conda run --no-capture-output -n openvino "$WHISPER_PATH" \
    -f "$WAV_FILE" \
    -m "$MODEL_PATH" \
    -l "$LANGUAGE" \
    -t 12 \
    $OVED_FLAG \
    -otxt \
    -of "\({OUTPUT_DIR}/\){BASENAME}"

The `set -e` trap

One early gotcha: the script uses set -e, which exits immediately on any non-zero return code. The grep -E '^[A-Z]' at the end of the device detection pipeline returns exit code 1 when there are no matches — which happens if conda produces no output at all. That silently killed the whole script before whisper ever ran.

The fix was being explicit about redirecting conda's noisy stderr and accepting that the grep might match nothing without treating it as a fatal error.

The Flask Web App

Running bash transcribe.sh recording.m4a from the terminal works fine, but it gets tedious when you're doing it repeatedly. I wanted a browser interface: drop a file, pick a language and model, watch the output stream in, cancel if needed.

The app has five routes:

GET / — renders the UI with available models and local files pre-populated
POST /api/upload — accepts files up to 2 GB
POST /api/transcribe — spawns a background thread and returns a job_id
GET /api/jobs//stream — SSE endpoint that streams log output line by line
POST /api/jobs//cancel — kills the job

Finding conda from inside Flask

The first runtime problem: Flask's dev server doesn't inherit the shell environment, so conda wasn't in PATH. Hardcoding /home/aaron/anaconda3/bin/conda would work on my machine but nowhere else. Instead, a small helper searches common install locations and also respects a CONDA_EXE environment variable override:

_CONDA_SEARCH_PATHS = [
    "/home/aaron/anaconda3/bin/conda",
    "/home/aaron/miniconda3/bin/conda",
    "/opt/conda/bin/conda",
    "/usr/local/anaconda3/bin/conda",
]

def find_conda() -> str:
    found = shutil.which("conda")
    if found:
        return found
    for p in _CONDA_SEARCH_PATHS:
        if os.path.isfile(p):
            return p
    raise RuntimeError("conda executable not found. Set CONDA_EXE to override.")

CONDA = os.environ.get("CONDA_EXE") or find_conda()

Real progress from Whisper's output

Whisper prints timestamps as it processes audio: [00:01.000 --> 00:04.000] Some transcribed text. The backend parses these timestamps and divides by the total audio duration to compute a real completion percentage, which gets pushed to the client over SSE. So the progress bar actually moves in proportion to how much has been transcribed, rather than pulsing indefinitely.

Model browser

Downloading models normally means running models/download-ggml-model.sh tiny in the terminal. I added a model browser panel in the UI: it lists all 30 available models with their sizes, marks which ones are already downloaded, and lets you download any of them with progress streaming over SSE. Smaller quantized models like small-q5_1 (182 MB) are a reasonable tradeoff against the full small (466 MB) if storage is a concern.

Cancellation and process groups

The stop button was the most interesting bug. Clicking it would show "Stopping…" in the UI, but whisper kept running and the output kept streaming. The issue: conda run is a wrapper that spawns whisper-cli as a child process. Sending SIGTERM to the conda run process leaves the child alive and running.

The fix is to put the subprocess in its own session when spawning it, then kill the entire process group on cancel:

proc = subprocess.Popen(
    cmd,
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
    start_new_session=True,
)

# on cancel:
os.killpg(os.getpgid(proc.pid), signal.SIGTERM)

start_new_session=True creates a new process group for conda run and all its descendants. os.killpg sends the signal to every process in that group at once, so whisper-cli stops immediately.

The Result

A self-hosted transcription app that runs entirely on local hardware, uses the integrated GPU when available, streams output in real-time, and handles cancellation cleanly. Drop in a meeting recording, pick a language, get a text file. Nothing leaves the machine.

The code is on the aaxa_openvino branch of my whisper.cpp fork if you want to take a look.

Dev Blogs