Your model got smarter.
But suddenly it got slower.
Why does increasing context length explode compute?
Because attention is O(n²).
And that becomes the real bottleneck in modern LLMs.
Core Idea
Attention compares every token with every other token.
That is powerful.
But it is expensive.
Efficient Attention methods try to answer one question:
How do we keep useful context while reducing cost?
This matters because long-context LLMs are useless if they are too slow or too expensive.
The Key Structure
Full Attention cost:
Attention Cost = O(n²)
Meaning:
n tokens → n × n comparisons
Example:
1,000 tokens → 1M comparisons
10,000 tokens → 100M comparisons
10× longer input → 100× more work
That is the bottleneck.
More compactly:
Attention = full connectivity + quadratic cost
Efficient Attention = reduce connections or optimize computation
Pseudo-code View
Full attention:
for i in tokens:
for j in tokens:
score[i][j] = dot(Q[i], K[j])
Efficient attention idea:
restrict or optimize comparisons
for i in tokens:
for j in selected_tokens:
score[i][j] = dot(Q[i], K[j])
Or:
compute same attention
but optimize memory access
Two strategies:
- reduce what you compute
- optimize how you compute
Concrete Example
Imagine reading a 10,000-token document.
Full Attention:
Every word looks at every other word.
That is like comparing every sentence to every sentence.
Local Attention:
Each word looks only at nearby words.
Like reading paragraph by paragraph.
Sparse Attention:
Each word looks at selected words.
Like focusing on keywords and headings.
FlashAttention:
Still reads everything.
But does it efficiently by avoiding unnecessary memory movement.
Different methods.
Same goal:
Reduce cost without losing important context.
Full Attention vs Efficient Attention
Full Attention:
- connects every token to every token
- captures long-range dependencies
- expensive in compute and memory
Efficient Attention:
- reduces connections or optimizes execution
- scales to longer sequences
- trades off some flexibility for efficiency
The key difference:
Full = maximum connectivity
Efficient = selective or optimized connectivity
Local Attention
Local Attention limits attention to a window.
Example:
Each token attends to last 128 tokens.
Cost becomes:
O(n × window)
Instead of O(n²)
This works because:
Nearby context often matters most.
But limitation:
Long-range dependencies can be missed.
Sparse Attention
Sparse Attention generalizes Local Attention.
Instead of full connections:
Use structured patterns.
Examples:
- local windows
- strided attention
- global tokens
- block patterns
This reduces cost while keeping some long-range connections.
But trade-off:
Too sparse → lose important relationships
So many models mix:
full attention + sparse attention layers
FlashAttention
FlashAttention does NOT change attention logic.
It changes how attention is computed.
Problem:
Attention is often memory-bound.
GPU spends time moving data, not computing.
FlashAttention solution:
- compute attention in blocks
- keep data in fast SRAM
- avoid storing large intermediate matrices
Instead of:
store full attention matrix → read again
It does:
compute on-the-fly → minimize memory movement
Key idea:
Optimize IO, not just math
Naive vs Optimized View
Naive view:
Attention cost = math operations
Optimized view:
Attention cost = math + memory movement
Naive:
compute QK^T
store matrix
apply softmax
Optimized (FlashAttention):
compute in chunks
avoid large memory writes
reuse data efficiently
This is why FlashAttention speeds up real systems.
Not by changing theory.
But by fixing hardware inefficiency.
Why This Matters (Again)
Early:
Attention made Transformers powerful.
Now:
Attention limits how far they can scale.
If you cannot optimize attention:
- context stays short
- inference becomes slow
- cost explodes
Efficient attention enables:
- longer context windows
- faster inference
- lower GPU cost
- production-scale LLM systems
Important Conditions and Limits
Local Attention:
- fast
- but weak for long-range dependencies
Sparse Attention:
- flexible
- but pattern design matters
FlashAttention:
- exact attention
- but requires hardware-aware implementation
Also:
Even optimized attention still grows with sequence length.
There is no free lunch.
Only better trade-offs.
Takeaway
Attention is the core of Transformers.
But it is also the bottleneck.
Full Attention = powerful but expensive
Efficient Attention = scalable but selective or optimized
The shortest version:
Efficient Attention = reduce connections OR optimize memory access
If you understand that, you understand why modern LLM engineering focuses so much on attention optimization.
Discussion
When working with long-context models, which matters more to you?
Accuracy from full attention or efficiency from optimized attention?
Originally published at zeromathai.com
Original article: https://zeromathai.com/en/efficient-attention-flashattention-sparse-en/
GitHub Resources
AI diagrams, study notes, and visual guides:
https://github.com/zeromathai/zeromathai-ai
Top comments (0)