Peregrine: What Stopped the Jump from Llama 3.2 1B to 3B

A short writeup of a one-day run at putting a bigger on-device language model on Peregrine, the voice assistant that lives inside the rig. We tried to move from Llama 3.2 1B to Llama 3.2 3B on the same board. The chip is fine with it. The runtime that ships with the operating system image is not. Here is what we did, what broke, and what changes about the upgrade plan.

Where Peregrine stands today

Peregrine runs on a Radxa Dragon Q6A, a small single-board computer built around the Qualcomm QCS6490. The chip pairs an 8-core Kryo CPU with an Adreno 643 GPU and a Hexagon v68 NPU rated at up to 12 TOPS. Our current board has 8 GB of LPDDR5 and runs Radxa's Ubuntu 24.04 (Noble) image. On top of all that, Llama 3.2 1B runs on the NPU through Qualcomm's Genie and QNN runtime at roughly 12 tokens per second. It has been steady in production. No audio leaves the rig, no cell connection required.

Why we wanted to try the 3B

A 12 GB version of the same board is on the way, which removes the easy answer for staying small on memory. The interesting question became whether the same chip could carry Llama 3.2 3B at the same context length and the same fully on-device path. A 3B in the same family writes meaningfully better answers than a 1B. On paper, the silicon has room.

What we tried

We compiled Llama 3.2 3B for the QCS6490 through Qualcomm's AI Hub, using the qcs6490-proxy compile target and the catalog entry for the 3.2 3B Instruct model. Quantization was W4A16. Context length was 1024 to match the current 1B bundle. The compile produced a 2.8 GB Genie bundle: three weight-share .bin parts, a genie_config.json, an htp_backend_ext_config.json, and the tokenizer files. Same shape as the working 1B bundle.

The bundle went onto the board in its own directory, separate from the working 1B. The runtime libraries on the board (libGenie.so, genie-t2t-run, and the five QNN libraries it pulls in) were symlinked from the 1B bundle's directory. The configs were patched to match the same idle and thermal behavior the 1B already runs. The systemd service got a drop-in override rather than an edit to the canonical service file, so reverting was a single delete and a daemon reload. The 1B service stayed the canonical default the whole way through.

What blocked us

The board would not load the new model. The piece of software on the board whose job is to take a compiled model file and bring it up on the NPU refused the new bundle and returned an error before it ever ran a token. We confirmed the refusal was not our launcher's fault by asking that loader to open the model directly. Same refusal.

The root cause is a version gap. The new bundle was compiled against QAIRT 2.42. The library on the board is QAIRT 2.40.1. Qualcomm's compiled context-binary format is not backward compatible across that gap. AI Hub no longer exposes a 2.40 compile target. The lowest you can ask for is 2.42, the default is 2.45, and the latest is 2.46. So no recompile we can do against today's AI Hub will produce a bundle that loads on the Radxa OS image as it currently ships.

What this changes about the ceiling

We had been describing the Hexagon v68 NPU on this chip as having a hard ceiling on how big a local model it could run. That framing was wrong. The chip can compute the 3B. The block is a software version mismatch between the runtime libraries shipping in Radxa's OS image and the runtime that AI Hub now targets. It is a software gap, not a silicon gap. That moves the next step from "buy bigger silicon" to "close the runtime gap," which is a much cheaper plan.

Path forward

For now, we are staying with the 1B model that is already in production. It is steady, it is fast enough for the way Peregrine gets used in the rig, and keeping the smaller-memory variant of the board keeps the bill of materials lower for anyone building one. If the runtime version gap closes and the larger-memory variant comes down in price, we will pick this back up. Until then, 1B is the answer.