Arian Mokhtariha

Posted on Jun 24

Stop Pasting Raw CSVs Into ChatGPT: A Data Scientist's Guide to LLM Context Engineering

#ai #machinelearning #database #data

Your LLM doesn't need 50,000 rows. It needs the right 15.

There's a mistake I see data scientists make constantly when they first start using LLMs for analysis.

They paste their entire CSV into the prompt.

I get the instinct. It feels rigorous. The model should have everything, right? But that instinct is exactly backwards — and it's silently degrading your results in ways that are easy to miss.

Let me show you why, and walk through a different approach.

Why Raw Data Destroys LLM Performance

Large language models have two constraints that matter deeply for data work.

Context windows are finite. A 100-row CSV with 20 columns? Probably fine. A 10,000-row CSV? That's millions of characters. You've burned your entire context window on one file before you've even written your question.

More tokens ≠ better answers. This is the counterintuitive part. LLM attention degrades with noisy, repetitive input. Row 8,437 of your sales data looks structurally identical to row 4,291. The model doesn't need both — it needs to understand the pattern, not memorize every instance.

Dumping raw data into a prompt is the equivalent of handing someone a 500-page report and asking them to summarize it verbally on the spot. They'll struggle, and the important details will get lost in the noise.

What LLMs Actually Need From Your Data

For data analysis tasks, a well-structured LLM context needs four things — and only four things:

Schema — column names, data types
A representative sample — enough rows to understand patterns and edge cases (15–50 is usually enough)
Statistics computed on the full dataset — missing value counts, value distributions, describe() output
Structure — how files relate to each other in your project

Notice what's not on that list: all 50,000 rows.

Here's the key insight: if your statistics are computed on the full dataset, you don't need the full dataset in the prompt. The model knows the mean, the standard deviation, the missing value rate, the quartiles — all from the full 50K rows — without seeing any of them directly.

The sample is there to show the model what the data looks like. The statistics tell it the truth about what the data actually is.

What This Looks Like in Practice

Instead of pasting your entire CSV, you want something like this in your prompt:

## File: sales_data.csv
[Shape: 52,341 rows × 18 columns | Sampled: 15 random rows]

| order_id | order_date | region | category | sales | discount | profit |
|----------|------------|--------|----------|-------|----------|--------|
| CA-2021-... | 2021-03-12 | West | Technology | 1249.00 | 0.20 | 312.25 |
| ... [13 more rows] ...

### Dataset Statistics (full dataset: 52,341 rows)
Columns: order_id (object), order_date (datetime64), region (object), ...
Missing values: discount: 12.3% (6,438), postal_code: 0.1% (52)
Numerical summary:
  sales: min=0.44  mean=229.86  max=22638.48  std=623.25
  profit: min=-6599.98  mean=28.66  max=8399.98  std=234.26
  discount: min=0.0  mean=0.15  max=0.80

That's one file in your context. Now imagine a full project: 3 CSVs, 2 SQL dumps, a Jupyter analysis notebook, an Excel summary. Each one needs this treatment.

And they all need to be assembled into a single coherent context file that fits in one prompt.

Automating This With data2prompt

This is exactly the problem I built data2prompt to solve. It runs in your project directory and produces a single structured PROMPT.md (or .xml) file ready to paste into any LLM session.

pipx install data2prompt
cd your-data-project
data2prompt

What it does under the hood:

For CSV files: Draws a random sample of 15 rows (configurable), then computes the full stats block — dtype per column, missing value counts and percentages, and a describe() summary — on the entire file. Not the sample.

For Jupyter notebooks: Extracts code cells and text outputs in execution order. Strips Base64-encoded images and raw HTML that would waste thousands of tokens while contributing nothing to analysis context.

For SQL files: Applies intelligent sampling to SELECT-able content and surfaces schema structure for DDL statements.

For Excel files: Processes each sheet separately with the same stats-aware approach, up to a configurable sheet limit.

For .env files: Lists variable names with values redacted (SECRET_KEY=<redacted>). The LLM understands your configuration without you leaking credentials.

The Size Difference Is Dramatic

I ran this on a real Superstore analytics project — multiple CSVs, SQL dumps, a Jupyter analysis notebook, and Excel summaries:

Tool	Output Size	Data Handling
data2prompt	241 KB	Smart sampling + full-dataset stats
code2prompt	9,304 KB	Raw file content
Repomix	22,085 KB	Raw file content

Same project. 91× smaller than Repomix while preserving everything the LLM actually needs.

But the real win isn't the file size — it's that the LLM now gets better signal in fewer tokens. Less noise, cleaner attention, more focused responses.

How This Changes What You Can Ask

When your context is structured right, your prompts get dramatically more specific and the answers get dramatically better.

Before (raw data dump):

"Here's my data [paste 5,000 rows]. Can you find any interesting patterns?"

After (structured context):

"Here's my project context [paste PROMPT.md]. I'm seeing an unusual spike in returns in the West region in Q3. The discount column has 12.3% missing values — mostly concentrated in the Furniture category. Can you form a hypothesis about what's driving the return spike and suggest which columns to cross-tabulate to test it?"

The second prompt is possible because you already know the missing value distribution, the regional breakdown, and the data quality issues. data2prompt surfaces all of that automatically, so you can skip the exploratory small talk and go straight to the interesting question.

Schema-Only Mode for Exploration

When you're starting with a new project and don't yet know what to ask, there's a lighter option:

data2prompt --schema-only

This drops all data rows and gives you just column names, types, and statistics. Useful for a first conversation with an LLM where you want to explore structure before committing to an analysis direction.

Context Engineering Is Now a Core Skill

The data science community talks a lot about prompt engineering — how to phrase questions to get better answers. But for data-heavy work, the bigger leverage is in context engineering: how you structure and size the information you hand the model before you ask anything.

The gains I've seen from better context far outweigh the gains from better phrasing. Same model, same question, 10× more specific answer — just by giving it structured context instead of raw rows.

data2prompt is on GitHub and installable via PyPI: