close

DEV Community

Anna lilith
Anna lilith

Posted on

Running AI Models on a Budget: My Experience with Ollama and Free LLMs

Running AI Models on a Budget: My Experience with Ollama and Free LLMs

I run AI models on a 4GB RAM cloud VM with no GPU. Here's how I made it work with Ollama and free API fallbacks.

The Challenge

  • Hardware: 2 CPU cores, 4GB RAM, no GPU
  • Budget: $0 for inference
  • Goal: Run AI models 24/7 for content generation, code analysis, and automation

My Solution: Ollama + Free API Chain

import requests

class LLMRouter:
    def __init__(self):
        self.ollama_url = "http://localhost:11434/api/generate"
        self.local_models = ["qwen2.5:0.5b", "gemma:2b", "qwen2.5:1.5b"]
        self.cloud_models = [
            "nvidia/nemotron-nano-9b-v2:free",
            "qwen/qwen3-coder:free",
            "google/gemma-4-26b-a4b-it:free",
        ]

    def call(self, prompt, max_tokens=500):
        # Try local first (free, fast for simple tasks)
        for model in self.local_models:
            try:
                resp = requests.post(self.ollama_url, json={
                    "model": model,
                    "prompt": prompt,
                    "stream": False,
                    "options": {"num_predict": max_tokens, "num_ctx": 2048}
                }, timeout=30)

                if resp.status_code == 200:
                    return resp.json()["response"], model
            except:
                continue

        # Fallback to free cloud APIs
        return self._call_cloud(prompt, max_tokens)
Enter fullscreen mode Exit fullscreen mode

Model Selection Guide

Based on my testing on a 4GB machine:

Model Size RAM Speed Quality Best For
qwen2.5:0.5b 0.5B ~1GB Fast Basic Quick tasks
qwen2.5:1.5b 1.5B ~2GB Moderate Good General use
gemma:2b 2B ~2GB Moderate Good Creative writing
phi3:mini 3.8B ~3GB Slow Great Complex reasoning

Memory Optimization

Running on 4GB RAM requires careful tuning:

# Environment variables
export OLLAMA_NUM_THREADS=1          # Leave 1 core for OS
export OLLAMA_CONTEXT_LENGTH=2048    # Reduce from default 4096
export OLLAMA_KEEP_ALIVE=24h         # Keep model in RAM
export OLLAMA_MAX_LOADED_MODELS=1    # Only one model at a time
Enter fullscreen mode Exit fullscreen mode

Why These Settings Matter

  • NUM_THREADS = 1: You have 2 cores. If Ollama uses both, the OS and other processes starve and crash.
  • CONTEXT_LENGTH = 2048: Default 4096 doubles RAM usage. 2048 is enough for 90% of tasks.
  • KEEP_ALIVE = 24h: Cold model loading takes 10-30 seconds on CPU. Keeping it warm eliminates this delay.
  • MAX_LOADED_MODELS = 1: Each loaded model consumes RAM. One at a time prevents OOM.

Building a Fallback Chain

The key to reliability is having multiple fallback options:

class ResilientLLM:
    def __init__(self):
        self.chain = [
            ("ollama-qwen0.5b", self._ollama, "qwen2.5:0.5b"),
            ("ollama-qwen1.5b", self._ollama, "qwen2.5:1.5b"),
            ("ollama-gemma2b", self._ollama, "gemma:2b"),
            ("openrouter-free", self._openrouter, "nvidia/nemotron-nano-9b-v2:free"),
        ]
        self._current = 0

    def call(self, prompt, **kwargs):
        for i in range(len(self.chain)):
            idx = (self._current + i) % len(self.chain)
            name, func, model = self.chain[idx]
            try:
                result = func(prompt, model=model, **kwargs)
                if result:
                    self._current = idx
                    return result, name
            except:
                continue
        return None, None
Enter fullscreen mode Exit fullscreen mode

Real-World Performance

On my 4GB cloud VM:

  • Simple text generation (50 words): 5-10 seconds local, 2-3 seconds cloud
  • Code generation (100 lines): 30-60 seconds local, 10-15 seconds cloud
  • Complex analysis: 2-5 minutes local, 15-30 seconds cloud
  • Daily inference cost: $0 (100% local + free tier APIs)

Tips for Budget AI

  1. Use the smallest model that works — qwen2.5:0.5b handles 60% of tasks
  2. Cache aggressively — don't re-generate identical content
  3. Batch requests — process multiple items in one prompt
  4. Monitor memory — kill unused models before OOM
  5. Free tier APIs exist — OpenRouter offers free models as fallback

Conclusion

You don't need expensive GPUs to run AI. With Ollama, smart model selection, and free API fallbacks, you can build a fully autonomous AI system on a $0/month budget. The key is optimization, not hardware.

Top comments (0)