Quantize

I got small LLMs running nicely on a Raspberry Pi

A wrapper plus web UI around llama.cpp tuned for single board computers like the Raspberry Pi. It streams tokens over websockets and stays under the Pi's tiny memory budget using Q4 quantization and mmap, so a 3B model actually fits in 2GB. Tokens per second are slow, but for a home assistant that mostly answers short questions it is real-time enough to feel responsive. I wanted something that ran entirely on hardware I own: no API keys, no usage caps, and no sending my house's voice commands off to someone's cloud. Setup is a single script that auto-detects the board. Local Whisper for voice input is the next milestone.

Devlog

Fitting a 3B model on a Pi 4

1mo ago

Q4 quantization plus mmap got a 3B model running in 2GB. Tokens per second are slow but real-time enough for voice.

Comments (1)

Maya Chen48d ago

Used something similar years ago that got acquired and killed. Glad this is open source.

In these collections

Self-hostable infraby @devon_ships · 0

Quantize

Devlog

Fitting a 3B model on a Pi 4

Comments (1)

Related Projects

Cohort

Beacon

Whisperdesk

Threadbare

In these collections