close

DEV Community

Arindam Majumder
Arindam Majumder Subscriber

Posted on

Building a Debate Council of LLMs to Stress-Test NVIDIA Cosmos 3

A benchmark score tells you how a model did on a test. It does not tell you whether the model can hold a position, take a punch, and adjust without falling apart.

That second thing is what I wanted to know about NVIDIA Cosmos 3, which NVIDIA had just shipped. So instead of running yet another eval, I did something more fun. I built the model an arena and made it argue with itself.

The result is Cosmos Arena, a multi-agent debate council. You hand it a motion, something like "This house believes AGI will arrive before 2035," and five roles fight it out:

  • An Advocate argues for, a Skeptic argues against
  • They trade rebuttals across several rounds
  • An optional Pragmatist pokes holes in both sides
  • An Arbiter scores everything and hands down a verdict

Here is the catch that makes it a real test: every seat runs on the same model. The only thing that changes is the role.

That is surprisingly hard to fake. A model that just knows how to sound smart will produce two confident speeches that never touch. A model that can actually reason opens round two by answering the exact weakness the other side exposed in round one. You watch the difference happen, turn by turn.

This tutorial is the full build. By the end you will have a working Streamlit app, you will get why the orchestration uses LangGraph instead of one model pretending to be everyone, and you will know how to serve Cosmos 3 through Nebius Token Factory.

First, what Cosmos 3 actually is

Cosmos 3

Let me be honest about the model up front, because it is not a normal chat model, and the name can mislead you.

NVIDIA built Cosmos 3 for Physical AI: robots, autonomous vehicles, factory floors, anything that has to understand motion, causality, and physics in the real world. NVIDIA launched it on June 1, 2026 at GTC Taipei and calls it the first fully open omnimodel with native vision reasoning.

Under the hood it is a Mixture-of-Transformers that pairs a reasoning transformer with an expert generation transformer. One half thinks about object interactions, motion, and space. The other half generates video and action trajectories.

The whole point is to let a robot reason before it acts, which cuts physical AI training cycles from months down to days. The NVIDIA technical blog goes deep if you want the full picture.

So why use a robotics world model to run a debate? Because that reasoning transformer is the interesting part. NVIDIA trained it to reason about the physical world, and I wanted to see how well that transfers to something it was never sold for: a pure-language argument, no images, no video, just ideas. That transfer tells you far more than a single-prompt score ever will.

Cosmos 3 specs at a glance

Here are the details that matter for this project.

Property Cosmos 3 Super Cosmos 3 Nano
Total parameters 64B 16B
Split 32B reasoner + 32B generator 8B reasoner + 8B generator
Architecture Mixture-of-Transformers (reasoning + generation) Mixture-of-Transformers
Built for Post-training robotics and AV models at the highest physics accuracy Fast video and action reasoning in a fraction of a second
License OpenMDW (open for commercial and non-commercial use) OpenMDW
Released June 1, 2026 (GTC Taipei) June 1, 2026

A few more things worth knowing:

  • It is genuinely omnimodal. It takes in and generates text, images, video, ambient sound, and action sequences, all in one model.
  • It was trained on one of the largest multimodal physical AI datasets out there, billions of samples across text, image, video, sound, and action.
  • It shipped with a Cosmos Coalition of robotics and AI labs (Agile Robots, Black Forest Labs, Runway, Skild AI, and others) building on top of it.
  • NVIDIA is upfront about the limits: generation can drift over time, and the reasoning can still hallucinate, since there is no physics simulator actually running in the loop.

For our build, the parts that matter are the reasoner tower and how well the model holds a role. The debate leans hard on both.

Where it runs: Nebius Token Factory

Image2

A 64B omni-model is not something I want to babysit on my own GPUs. So every model call in this project goes through Nebius Token Factory.

Token Factory is Nebius's production inference platform. It takes open and partner models, including NVIDIA's, and serves them behind one fast, OpenAI-compatible API, with the posttraining and governance pieces handled for you.

NVIDIA models like Nemotron already run there, and Nebius has been building cloud infrastructure with NVIDIA specifically for robotics and physical AI. That makes it a natural home for Cosmos.

Why it fits this project so well:

  • OpenAI-compatible API. Anything built for OpenAI works with a base-URL swap. The base URL is https://api.tokenfactory.nebius.com/v1/.
  • Drop-in LangChain support. The langchain-nebius package gives you a ChatNebius model that slots straight into LangGraph.
  • No GPUs to provision. You point at a model name and pay per token. That is the whole setup.
  • One key for five seats. Every council member shares a single NEBIUS_API_KEY, so there are no per-agent credentials to juggle.

Grab a key from the Nebius Token Factory console and you are ready.

What we are building

The council has five roles. Four of them are model calls. The fifth, the Moderator, is the graph itself, and that turns out to be the decision that makes everything work.

Role Node Job
The Advocate proponent Argues for the motion, rebuts the Skeptic each round
The Skeptic opponent Argues against the motion, rebuts the Advocate each round
The Pragmatist pragmatist Independent member who stress-tests both sides (optional)
The Arbiter judge Scores logic, evidence, and rebuttal, then gives a verdict
The Moderator the graph Routes turns, threads the transcript, decides when to stop

The flow goes from opening statements, through alternating rebuttal rounds, to an optional reality check, and finally a scored verdict:

        +--------------+      +-------------+
START ->|  proponent   | ---> |  opponent   | --> (more rounds?)
        +--------------+      +-------------+         |
              ^  more rounds: next_round              | no
              +---------------------------------------+
                                                      v
                                  (pragmatist?) -> judge -> END
Enter fullscreen mode Exit fullscreen mode

Why a graph and not one clever prompt

You could try the lazy version: one prompt that says "argue both sides of X, then judge it." It reads fine and proves nothing.

The problem is that the model writes the "for" case and the "against" case in a single breath. They do not respond to each other, because they were written together. There is no exchange, just a model acting out the idea of a debate.

A real debate needs structure that a prompt cannot promise:

  • Each side gets its own turn, so each argument is a focused generation
  • A rebuttal sees what the other side actually just said
  • Rounds stack up, so later the Advocate answers a real objection instead of repeating its opener
  • The judge reads the full transcript and scores it on fixed criteria

That is a state machine, not a prompt, which is exactly what LangGraph is for. Putting the structure in code means each role gets its own isolated call, rebuttals genuinely see the prior turn, and the round count is enforced rather than left to the model's mood.

Prerequisites

The dependency list is short on purpose:

dependencies = [
    "langgraph>=1.0",
    "langchain-nebius>=0.1.3",
    "langchain-core>=0.3",
    "python-dotenv>=1.1.1",
    "streamlit>=1.47.0",
]
Enter fullscreen mode Exit fullscreen mode

Project layout:

cosmos_arena_debate_council/
  app.py              # Streamlit UI and live debate streaming
  cosmos_council.py   # LangGraph debate graph: nodes, routing, model
  pyproject.toml      # Dependencies
  .env.example        # Environment variable template
  assets/             # NVIDIA and Nebius logos
Enter fullscreen mode Exit fullscreen mode

Get set up:

git clone https://github.com/Arindam200/awesome-ai-apps.git
cd awesome-ai-apps/advance_ai_agents/cosmos_arena_debate_council
uv sync
cp .env.example .env   # then add your NEBIUS_API_KEY
Enter fullscreen mode Exit fullscreen mode

Your .env:

NEBIUS_API_KEY=your_api_key_here
# Optional overrides
COSMOS_MODEL=nvidia/Cosmos3-Super-Reasoner
NEBIUS_BASE_URL=https://api.tokenfactory.nebius.com/v1/
Enter fullscreen mode Exit fullscreen mode

Step 1: Wire up Cosmos 3 and handle the reasoning channel

This is the one gotcha that will trip you up, so it goes first.

The stock ChatNebius integration reads the answer from message.content, the usual OpenAI shape. But Cosmos, served as a reasoner, often puts its answer in a non-standard reasoning field and leaves content empty. Use the integration as-is and every council member comes back blank.

The fix is a small subclass that folds the reasoning field back in. If content is empty, the reasoning is the answer. If both are there, the reasoning becomes separate chain-of-thought for the UI to show.

from langchain_core.outputs import ChatResult
from langchain_nebius import ChatNebius

DEFAULT_BASE_URL = "https://api.tokenfactory.nebius.com/v1/"
DEFAULT_MODEL = "nvidia/Cosmos3-Super-Reasoner"


class CosmosChatNebius(ChatNebius):
    """ChatNebius that surfaces the non-standard `reasoning` field."""

    def _create_chat_result(self, response, generation_info=None) -> ChatResult:
        result = super()._create_chat_result(response, generation_info)
        response_dict = response if isinstance(response, dict) else response.model_dump()
        for gen, choice in zip(result.generations, response_dict.get("choices") or []):
            reasoning = (choice.get("message") or {}).get("reasoning")
            if not reasoning:
                continue
            message = gen.message
            if (message.content or "").strip():
                # Real answer present, keep reasoning as separate chain-of-thought.
                message.additional_kwargs.setdefault("reasoning_content", reasoning)
            else:
                # Empty content, so the reasoning is the answer.
                message.content = reasoning
        return result
Enter fullscreen mode Exit fullscreen mode

Cosmos can also do it the other way, wrapping its reasoning in <think> tags inside the content. A small splitter pulls the two apart so the UI never mixes thinking into the actual argument:

import re

_THINK_RE = re.compile(r"<think>(.*?)</think>", re.DOTALL | re.IGNORECASE)


def split_reasoning(text: str) -> tuple[str, str]:
    """Separate the visible answer from the model's <think> reasoning."""
    text = text or ""
    reasoning_parts = [m.strip() for m in _THINK_RE.findall(text)]
    clean = _THINK_RE.sub("", text).strip()
    if "<think>" in clean.lower():  # unclosed reasoning block
        idx = clean.lower().index("<think>")
        reasoning_parts.append(clean[idx + len("<think>"):].strip())
        clean = clean[:idx].strip()
    return clean.strip(), "\n\n".join(p for p in reasoning_parts if p).strip()
Enter fullscreen mode Exit fullscreen mode

A shared factory ties it together. Every seat calls this same function, so the model never changes, only the prompt does:

import os


def build_model(api_key=None, model=None, base_url=None, temperature=0.6) -> ChatNebius:
    """Create the shared Cosmos reasoner backed by Nebius Token Factory."""
    return CosmosChatNebius(
        model=model or os.getenv("COSMOS_MODEL", DEFAULT_MODEL),
        api_key=api_key or os.getenv("NEBIUS_API_KEY"),
        base_url=base_url or os.getenv("NEBIUS_BASE_URL") or DEFAULT_BASE_URL,
        temperature=temperature,
    )
Enter fullscreen mode Exit fullscreen mode

Step 2: Give each seat a persona

The roles are just system prompts, but they are written to force genuinely different behavior. The Advocate and Skeptic are told to rebut the other side point by point before adding anything new. The Pragmatist takes no side. The Arbiter has to produce a fixed scorecard. Here are two of them:

PROPONENT_PROMPT = (
    "You are The Advocate, a council member in the Cosmos Arena debate.\n"
    "Your role: argue persuasively and rigorously IN FAVOR of the motion.\n\n"
    "Guidelines:\n"
    "- Make the strongest honest case for the motion.\n"
    "- Ground claims in reasoning, evidence, and concrete examples.\n"
    "- If you are given the opposition's prior argument, directly REBUT it "
    "point by point before adding new arguments.\n"
    "- Be sharp and confident, but never fabricate facts.\n"
    "- Keep it focused: 3-5 tight paragraphs in markdown. End with your "
    "single strongest line.\n"
    "- Output ONLY your argument. Do not narrate your process."
)

JUDGE_PROMPT = (
    "You are The Arbiter, the impartial judge of the Cosmos Arena debate.\n"
    "You will be given the full debate transcript.\n\n"
    "Deliver your verdict as markdown with EXACTLY these sections:\n\n"
    "### Scorecard\n"
    "A markdown table scoring each side (Proponent, Opponent) from 0-10 on "
    "**Logic**, **Evidence**, and **Rebuttal**, with a **Total** column.\n\n"
    "### Verdict\n"
    "State the winner (or an honest draw) in one bold sentence, then 2-3 "
    "sentences justifying it based strictly on the arguments made.\n\n"
    "### Strongest Argument\n"
    "Quote or paraphrase the single most decisive point.\n\n"
    "### What Would Change the Outcome\n"
    "One short paragraph on the evidence or reasoning that would flip the result."
)
Enter fullscreen mode Exit fullscreen mode

This is where you find out if Cosmos 3 can really be five different people. Weaker models leak. The Skeptic starts agreeing, or the Arbiter picks a winner before it reads anything. A strong reasoner keeps the seats clean all the way through.

Step 3: Model the debate as state

The whole debate lives in one typed state object. The trick is transcript, an append-only list. Each node returns only its single new turn, and LangGraph's reducer, operator.add, tacks it onto the running transcript. That means each streamed update is exactly one new turn, which is perfect for a live UI.

import operator
from typing import Annotated, TypedDict


class Turn(TypedDict):
    speaker: str   # proponent | opponent | pragmatist | judge
    round: int     # debate round (0 for judge and pragmatist)
    text: str      # the visible argument
    reasoning: str # the model's chain-of-thought, if any


class DebateState(TypedDict):
    motion: str
    current_round: int
    # operator.add makes each node APPEND its turn to the transcript.
    transcript: Annotated[list[Turn], operator.add]
Enter fullscreen mode Exit fullscreen mode

One small helper grabs the other side's most recent turn. This is the piece that makes rebuttals real instead of generic:

def _latest(transcript: list[Turn], speaker: str) -> str:
    for turn in reversed(transcript):
        if turn["speaker"] == speaker:
            return turn["text"]
    return ""
Enter fullscreen mode Exit fullscreen mode

Step 4: Write the nodes

Each council member is a node: one model call with a role prompt and a user prompt built from the live transcript. Watch how the proponent's prompt changes between the opening round and later rounds. From round two on, it gets handed the Skeptic's latest argument and told to rebut it point by point. That is the whole difference between a debate and two people talking past each other.

def _say(model, system_prompt: str, user_prompt: str) -> tuple[str, str]:
    response = model.invoke([
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ])
    answer, inline_reasoning = split_reasoning(response.content or "")
    reasoning = response.additional_kwargs.get("reasoning_content") or inline_reasoning
    return answer, reasoning


def proponent_node(state: DebateState) -> dict:
    rnd = state["current_round"]
    if rnd == 1:
        user = (
            f"The motion before the council:\n\n> {state['motion']}\n\n"
            "Deliver your OPENING case in favor of the motion."
        )
    else:
        user = (
            f"The motion before the council:\n\n> {state['motion']}\n\n"
            "The Skeptic's most recent argument was:\n\n"
            f"{_latest(state['transcript'], 'opponent')}\n\n"
            f"This is round {rnd}. Rebut the Skeptic point by point, then "
            "press your strongest new arguments for the motion."
        )
    text, reasoning = _say(model, PROPONENT_PROMPT, user)
    return {"transcript": [Turn(speaker="proponent", round=rnd, text=text, reasoning=reasoning)]}
Enter fullscreen mode Exit fullscreen mode

The judge node works the same way but reads the entire transcript and produces the scorecard. Since the transcript is already clean and structured, there is no fragile report parsing. Clean text in, verdict out.

Step 5: Let the graph moderate

Here is the payoff. The Moderator is not a model deciding what comes next, it is plain deterministic routing. START to proponent to opponent, then a conditional edge decides: run another round, or wrap up? If rounds remain, bump the counter and loop back. If not, run the optional Pragmatist, then the Arbiter, then end.

from langgraph.graph import END, START, StateGraph


def build_debate_graph(model, rounds: int = 2, use_pragmatist: bool = True):
    # ... node definitions: proponent, opponent, pragmatist, judge ...

    def route_after_opponent(state: DebateState) -> str:
        if state["current_round"] < rounds:
            return "increment_round"
        return "pragmatist" if use_pragmatist else "judge"

    builder = StateGraph(DebateState)
    builder.add_node("proponent", proponent_node)
    builder.add_node("opponent", opponent_node)
    builder.add_node("increment_round", increment_round_node)
    builder.add_node("judge", judge_node)
    if use_pragmatist:
        builder.add_node("pragmatist", pragmatist_node)

    builder.add_edge(START, "proponent")
    builder.add_edge("proponent", "opponent")
    builder.add_conditional_edges(
        "opponent",
        route_after_opponent,
        ["increment_round", "pragmatist", "judge"]
        if use_pragmatist
        else ["increment_round", "judge"],
    )
    builder.add_edge("increment_round", "proponent")
    if use_pragmatist:
        builder.add_edge("pragmatist", "judge")
    builder.add_edge("judge", END)

    return builder.compile()
Enter fullscreen mode Exit fullscreen mode

Step 6: Stream it live in Streamlit

The UI streams with stream_mode="updates", so each member's argument shows up the moment its node finishes. Color-coded card, correct round, and a collapsible panel that exposes Cosmos 3's chain-of-thought.

for update in graph.stream(
    initial_state(motion), config={"recursion_limit": 60}, stream_mode="updates"
):
    delta = next(iter(update.values()))
    new_turns = delta.get("transcript") if isinstance(delta, dict) else None
    if not new_turns:          # skip the increment_round bookkeeping step
        continue
    turn = new_turns[-1]
    card.complete(turn["text"], turn["reasoning"], live=True)
Enter fullscreen mode Exit fullscreen mode

Run it:

uv run streamlit run app.py
Enter fullscreen mode Exit fullscreen mode

Open http://localhost:8501, pick your number of rounds (1 to 4) and whether to seat the Pragmatist, type in a motion, and hit "Convene the Council."

What it looks like in action

Let me run the motion "This house believes AGI will arrive before 2035."

The Advocate opens strong on compute scaling curves, efficiency gains, and the money pouring into the field. Confident, concrete, ends on a sharp line.

The Skeptic does not blink. A trend line is not a mechanism, benchmark progress is not general capability, and "before 2035" is a specific claim that needs a specific argument the other side has not made.

Round two is where it gets good. The Advocate opens by quoting the Skeptic's "no mechanism" point and answering it head-on before pressing forward. The rebuttal threading is working. This is now an actual exchange.

The Pragmatist steps outside the fight and calls it: both sides are arguing definitions. What would really settle it is a measurable capability threshold tied to a date. Name it, or you are just debating vibes.

Then the Arbiter closes with a scorecard:

Side Logic Evidence Rebuttal Total
Proponent 7 6 8 21
Opponent 8 7 7 22

It gives the win to the Skeptic, narrowly. The Advocate argued well and rebutted directly, but leaned on extrapolation where the Skeptic demanded a mechanism, and that hard 2035 deadline raised a bar the Advocate never quite cleared.

The fun part is not who won. It is that the round-two rebuttal genuinely engaged the round-one objection. That only happens if the model can hold a position, absorb a counter, and adjust, which is exactly the reasoning I was trying to see.

What this tells you about Cosmos 3

Running a few motions through the arena surfaces things a benchmark never will.

  • Role discipline. Does the Skeptic stay skeptical for four straight rounds, or quietly start agreeing? Cosmos held its seats.
  • Rebuttal quality. Do later rounds answer the specific prior point, or just restate the opener with new adjectives? This is the clearest signal of real reasoning, and you can see it live.
  • Judgment calibration. Does the Arbiter's verdict actually follow from the transcript, or does it pick a side and backfill? Read the scorecard against what was said and you will know fast.

A debate is really a reasoning stress test in disguise: adversarial, multi-turn, and either self-consistent across rounds or not. For a model whose reasoner was trained mostly on physical and spatial problems, watching it carry that reasoning into abstract language debate is a genuinely interesting result.

The trade-offs, honestly

This is not free. A two-round debate with the Pragmatist is six full reasoning-model calls: the Advocate twice, the Skeptic twice, the Pragmatist, and the Arbiter. Reasoning models also emit a lot of thinking tokens. More rounds means more cost and more waiting.

For this use case it is worth it, because the structure is the product and watching the reasoning unfold is the whole point. For a plain question-and-answer task, it would be massive overkill. Match the architecture to the job.

Wrapping up

You now have a working multi-agent debate council. It models a structured debate as an explicit LangGraph state machine instead of a prompt, threads real rebuttals through a shared append-only transcript, runs every seat on NVIDIA Cosmos 3 while surfacing its chain-of-thought, and serves the whole thing through one Nebius Token Factory key over an OpenAI-compatible API.

From here the graph makes it easy to keep going. Add seats like a Historian or a Domain Expert. Let the Arbiter call a tie-breaker round. Wire in retrieval so arguments cite real sources. Or run a tournament of motions and chart which side the model tends to favor. Each one is just a few more nodes and edges.

Want to try it? Clone the repo, grab a Nebius Token Factory key, and convene your own council. Pick a motion you genuinely cannot call, and see how Cosmos 3 reasons its way through it.


Built by Arindam Majumder. Part of the awesome-ai-apps collection, powered by LangGraph, NVIDIA Cosmos 3, and Nebius Token Factory.

Top comments (0)