Run MiniCPM4 with CPU only

2025-06-08

This is a guide for serving the MiniCPM4 0.5b model with CPU only(in my laptop WSL Ubuntu 24.04).

The model is served by llama.cpp, and I break down it for several steps.

  • Download the model from hf
huggingface-cli download openbmb/MiniCPM4-0.5B
  • Install(and compile) the llama.cpp
# build llama.cpp
git clone https://github.com/ggml-org/llama.cpp.git
sudo apt install python3-dev build-essential cmake libcurl4-openssl-dev
cmake -B build  
cmake --build build --config Release

# prepare llama tools
uv venv --python=3.12 .venv
source .venv/bin/activate
uv pip install -r requirements/requirements-convert_hf_to_gguf.txt --index-strategy unsafe-best-match
  • Convert the downloaded safetensors model to gguf format
# the default model location of huggingface-cli
python convert_hf_to_gguf.py \
~/.cache/huggingface/hub/models--openbmb--MiniCPM4-0.5B/snapshots/ebf6ddf19764646a49d94e857fb4eb439f35ecfb/ \
--outfile /path/to/minicpm4-0.5b.gguf
  • Serve the gguf file with llama.cpp
build/bin/llama-server --model /path/to/minicpm4-0.5b.gguf

Here you can call the model with OpenAI API protocol at http//127.0.0.1:8080/v1 .

The .gguf file can also be served with ollama:

echo 'FROM /path/to/minicpm4-0.5b.gguf' > Modelfile
ollama create MiniCPM4:0.5b -f Modelfile