Deploy models on SGLang with CPU only

August 3, 2025

If you just want to serve an LLM model with WSL2 on your old laptop, this is for you.

The official document have introduced how to build and run SGLang in CPU with docker. But there several issues with the image built based on the tutorial:

SGLang deps on VLLM, which is not installed
SGLang code requires NUMA to run

So I build a image to meet the above problems, which is available at https://hub.docker.com/r/metaphorprojects/sglang-cpu .

The image have VLLM installed. As for the NUMA problem, the SGLang code base actually use lscpu -p=CPU,Core,Socket,Node to get the info of NUMA nodes. The code will fail to parse the result on systems without NUMA support, so I replace 1 line of code to bypass the check.

The image could be started as

docker run \
  -it \
  --rm \
  --privileged \
  -v /dev/shm:/dev/shm \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 30000:30000 \
  -e "HF_TOKEN=<secret>" \
  metaphorprojects/sglang-cpu /bin/bash

We could serve a model now:

python -m sglang.launch_server \
   --trust-remote-code \
   --disable-overlap-schedule \
   --tool-call-parser qwen25\
   --device cpu\
   --host 0.0.0.0\
   --tp 1\
   --model Qwen/Qwen3-0.6B

Now use the OpenAI compatible endpoint at `http://127.0.0.1:30000/v1 and start to call the model(with tools calling supported).