Quick release note. Cli-Modelarium 0.1.4 just shipped, and the headline is two new providers.
Two new providers, ten in total
You can now compare Alibaba's Qwen models (via DashScope) and Z.AI's GLM models side by side with the rest of the lineup: OpenAI, Anthropic, Google, xAI, DeepSeek, Mistral, Groq, OpenRouter, plus your local models. That brings it to 10 cloud providers.
If you have wanted to benchmark the open-weight models against the frontier ones on your own prompts, it is now a single command:
pip install --upgrade cli-modelarium
cli-modelarium "Write a haiku about garbage collection in programming" \
--models qwen3.7-max,glm-5.2,gpt-5.4,claude-opus-4-8 \
--runs 10 --max-cost 0.50
You get a side by side table with cost and latency per model. With --runs greater than 1 it repeats the trials and runs the statistical tests automatically, so you can tell a real difference from noise instead of eyeballing one output. The --max-cost flag is a hard cap, so a multi-model run does not surprise your API bill.
Also in this release
- Refreshed all pricing to current provider rates
- Added Qwen and GLM to the model groups (all-flagship, all-budget, all-fast, all-cheap), plus GLM to all-reasoning, so you can pull them in by group
- Added Python 3.14 support
- A few model id updates to track provider renames
New here?
Cli-Modelarium is a command line tool for comparing LLM outputs side by side, with real statistics (bootstrap confidence intervals, paired significance tests, McNemar's), CI-ready assertions, hallucination detection, LLM-as-judge scoring, and cost tracking. One pip install, no infrastructure, Apache 2.0.
- GitHub: https://github.com/lavellehatcherjr/cli-modelarium
- PyPI: https://pypi.org/project/cli-modelarium/
Would love to hear how the new providers work for your use case.
Top comments (0)