DEV Community: will peixoto

AWS Lambda MicroVMs: run untrusted code with VM-level isolation (no infra to manage)

will peixoto — Wed, 24 Jun 2026 12:44:14 +0000

AWS just shipped Lambda MicroVMs, a new serverless primitive that gives each user or session a VM-level isolated sandbox, with near-instant launch and state preserved for up to 8 hours, all on Firecracker. Here is what it is, when to reach for it instead of a plain Lambda Function, and how to architect on top of it.

🇧🇷 Leia em português.

Let me put you in a situation. You need to run a piece of code you did not write. Maybe it is the script your user pasted into your platform, maybe it is the snippet an AI agent just generated and wants to execute. And then comes the question that keeps anyone working with multi-tenant up at night: how do I run this without handing a stranger the keys to the house?

Until last week you had three paths, each with a catch. A VM gives you strong isolation but takes minutes to boot. A container starts in seconds but shares a kernel, so running untrusted code there takes a pile of hardening. And the Lambda Function was built for short request-response, not for a session that has to keep live state between one interaction and the next (externalizing it to DynamoDB stores the data, not the live runtime: the running process, the loaded packages, the memory). In the end you chose between performance and isolation. No way around it. Or there was.

Container, VM, or Lambda: the trade-off none of them solved alone

This pattern got common: AI coding assistants, interactive code environments, analytics, vulnerability scanners, game servers running player scripts. They all need the same thing: give each user their own environment to run code the team did not write, safely and without lag.

The knot is that real isolation and low latency pull in opposite directions. From a security angle you want a hard boundary between tenants (the Security pillar of the Well-Architected Framework: isolate what is not trusted). From an experience angle you want that environment up the instant the user shows up. Reconciling the two was the expensive work.

And there is a nice irony in this story. We spent years learning to build stateless apps, and now state is a requirement again.

The solution to the future was hiding in the past.

That is a line a friend dropped in a conversation, and it has not left my head since. Ever felt that way? Because I have. And it is roughly what Lambda MicroVMs does: it brings state back, without handing you the weight of a full VM.

What Lambda MicroVMs is

Lambda MicroVMs is a new primitive inside Lambda, built exactly for that gap. Each MicroVM gives a single user or session its own isolated environment that boots fast, keeps memory and disk for the whole session, and pauses to a low cost when the user steps away.

The magic comes from Firecracker, the same lightweight virtualization that already runs over 15 trillion Lambda invocations a month. This is not raw new tech, it is the mature foundation of Lambda itself, exposed in a new way.

The model is image-then-launch:

You build the image once (AWS runs your Dockerfile, initializes the app, and takes a snapshot of memory and disk). After that, every MicroVM you launch resumes from that snapshot instead of cold-booting. That is why launch and resume are near-instant, even for a multi-gigabyte session.

What it is actually for (with examples you will recognize)

The main cue: this only enters the picture if you are building a platform that runs third-party code. If your app does not execute outside code, you do not need it. It is a building block for people who build that kind of product:

Replit, CodeSandbox, "VS Code in the browser": the user types code in the browser and it runs isolated, per user, holding state while the tab is open. That "runs isolated" is the MicroVM.
Code interpreter (like ChatGPT's or Claude's): you ask "plot this CSV", the AI writes Python and runs it to answer you. The runtime that executes that generated code, isolated per conversation, is the use case.
CI/CD runner (and relatives): a job runs the code of a Pull Request that may come from any stranger's fork, untrusted by definition, so you want an isolated, disposable runner per job. Same family: a scanner that runs a suspicious binary, a coding-interview platform (the candidate's code runs isolated), an AI agent that runs shell commands.

The thread tying it all together: each user, session, or job needs its own isolated environment, and the code running there is not code you wrote. That is the cue to use a MicroVM instead of a Lambda Function.

Lambda Function or Lambda MicroVM?

They do not compete, they complete each other. The official comparison:

	Lambda Functions	Lambda MicroVMs
Best for	request-response or event-driven (APIs, data processing, automation)	persistent environments running user or AI-produced untrusted code
Programming model	function handler invoked in a supported runtime	any application: run your own binaries, listen on ports, use Linux OS capabilities
Duration	up to 15 min per invocation; multi-step workflows up to a year with Lambda Durable Functions	up to 8 hours per session; suspend and resume across sessions
Runtime	service-provided runtimes (or customer-provided)	customer-provided MicroVM images
Inbound networking	direct invocations or event-source integrations; response streaming	inbound access to any port using OSI Layer 7 protocols
Concurrency	one request per execution environment at a time	multiple concurrent connections per MicroVM
Environment state	warm starts may reuse the environment, but state may not persist across invocations	memory and disk state preserved on suspend, restored on resume
Scaling	automatic: Lambda creates and destroys environments in response to traffic	developer-controlled: you create, suspend, resume, and terminate via API
Lifecycle	fully managed by Lambda	developer-controlled, with optional idle policies
Pricing	per-request + GB-seconds	per-second of compute while running + snapshot storage while suspended

The most common confusion: people assume the duration is the same as Lambda's. The startup is similar (both resume from a snapshot), but a Function dies at 15 minutes while a MicroVM holds a session for up to 8 hours with state intact. The real design: your app keeps Lambda Functions for the event-driven backbone, and calls MicroVMs only for the steps that need to run untrusted code in isolation.

How it works in practice: from endpoint to orchestration

Three things that trip people up at first, together.

The endpoint has a status. When you call run-microvm, you get an ID and a dedicated HTTPS endpoint for that MicroVM. But it is not ready instantly: it goes through states, from launch to RUNNING (about 2 seconds), and when idle it moves to suspended, coming back on resume. The endpoint is per MicroVM, per session.

One image, many MicroVMs. You build the image once (create-microvm-image) and each MicroVM is a run-microvm. Want two? Call it twice, and you get two independent instances. Idle behavior is governed by the idle-policy: maxIdleDurationSeconds (suspend after X idle) and autoResumeEnabled (the next request wakes the MicroVM on its own, in about 1s, no manual restart). When you are done, terminate-microvm releases everything.

You become the orchestrator. Since the endpoint is per session, something has to decide when to launch and where to route. Typically a Lambda Function in the backbone does it: it keeps a session -> MicroVM map (a store like DynamoDB in production), calls RunMicrovm on a user's first access, stores the ID and endpoint, mints a short-lived token with CreateMicrovmAuthToken, and proxies the request to the MicroVM's endpoint with the X-aws-proxy-auth header. If the instance is suspended and autoResume is on, the request itself wakes it. Add a routine to terminate orphan MicroVMs and you have the skeleton. The backbone code is in the next post in the series. And do not confuse this with Step Functions: MicroVM is the execution environment, Step Functions is an orchestrator, different layers.

Cost, limits, and what is still missing

Cost is a decision, not a detail. Werner Vogels keeps hammering in the Frugal Architect that cost is an architecture requirement, not a number you discover on the bill. The suspend is exactly that in practice: you pay a lot for VM-level isolation, but only while the user is active. When they leave, the MicroVM suspends and the cost drops, with no loss of state. Designing your idle-policy on purpose is a cost decision. The model, from the official table: you pay per second of compute while it runs, and only snapshot storage while it is suspended. Unit prices are on the Lambda pricing page.

Limits: ARM64, up to 16 vCPUs, 32 GB of memory, and 32 GB of disk per MicroVM, and up to 8 hours of total runtime. Provisioning is flexible: you set a baseline and burst up to 4x at peak, paying the baseline while it runs.

IaC: you can use the console, CloudFormation, and CDK.

Why Dockerfile + zip, and not a prebuilt ECR image? Aidan Steele dug into it: Lambda builds two copies of the image, one for Graviton 3 and one for Graviton 4, so it needs the source to recompile. The base comes from ECR Public, but pushing your own prebuilt image from a private ECR as the artifact is not the path. One thing that confuses people coming from containers: ECR does not leave your life. You do not deliver the MicroVM image via ECR, but inside the running MicroVM you can run Docker and docker pull your private ECR images at runtime. ECR is for consumption inside, not for delivering the image itself.

Networking and region: inbound traffic on configurable ports (HTTP/2, gRPC, WebSockets), service-provided JWE auth, outbound to the internet or your VPC. And it is available so far only in US East (N. Virginia, Ohio), US West (Oregon), Europe (Ireland), and Asia Pacific (Tokyo).

When NOT to use it

If the workload is short request-response with no state, it stays a Lambda Function. A MicroVM there is a cannon for a mosquito. And if you just need more than 15 minutes with your own (trusted) code, a MicroVM is also overkill: for a long job, look at Fargate; for a multi-step workflow, Lambda Durable Functions (up to a year, as the table shows). MicroVMs are for when the differentiator is isolating untrusted code, not just going past 15 minutes.

There is also a gotcha AWS itself flags, and it rhymes with the determinism conversation: since the MicroVM boots from a pre-initialized snapshot (the equivalent of Lambda SnapStart, as Aidan Steele confirmed by testing), apps that generate unique content, open connections, or load ephemeral data at init may diverge. The snapshot froze a moment; whatever needs to be fresh per session cannot be frozen along with it. The fix has a name: lifecycle hooks to re-initialize randomness when each MicroVM is created. Map that out before assuming it just works.

Does it kill the container?

No, and the reason is even better.

The hype of the week is "containers are obsolete." They are not. Quite the opposite: Aidan Steele tested it and you can run Docker inside a MicroVM, with OS capabilities enabled. So the MicroVM does not kill the container, it is more isolated and still runs containers inside. The honest cut is different: there is one specific spot, running untrusted code in isolation, where you will no longer want to harden a container by hand. There the MicroVM wins. Everywhere else, the container is still king.

The details the docs leave out

Aidan Steele spent launch day poking at the service and found some really interesting things that are not in the official docs.
I read it and figured it was worth bringing here:

You can get a shell into the MicroVM, via the CreateMicrovmShellAuthToken API, with pty as a first-class citizen (Lambda Functions do not have it). Gold for IDE and coding-agent use cases.
Outbound UDP is blocked by default and DNS is a local stub, so DNS inside a container falls back to 8.8.8.8 and fails. The fix is to run with Lambda's DNS: docker run --dns 169.254.169.253, or go via VPC.
Lambda network connectors: a reified VPC config (subnets, security groups, an IAM role for the ENI) with its own lifecycle. The network team creates it, the developer just consumes it.
Performance (his tests): image build 2-3 min; RunMicrovm to RUNNING about 2s, plus 2s to serve; suspend and resume about 1s each.

What you take away

Lambda MicroVMs fills a real gap: VM-level isolation with near-instant launch and per-session state, which no single service delivered together.
It does not replace the Lambda Function, it complements it. Function in the backbone, MicroVM for the untrusted code.
The idle suspend is a deliberate cost lever, design your idle-policy on purpose.
Before locking in architecture: check the region (no São Paulo yet), the limits (ARM64, 16 vCPU, 32 GB, 8h), and the snapshot caveat.

This post was the map. In the next one in the series I actually spin up a MicroVM and we prove the isolation in practice, launching two MicroVMs and testing whether one can reach the other, with the repo on GitHub for you to run along.

Got a case where you run user or AI code that today is duct-taped onto a container or a hand-rolled VM? Does this primitive fit? Drop a like, share it with whoever is building a multi-tenant platform, and let's talk. Cheers! =D

Originally published on willpeixoto.dev.

High Availability Has a Price: Resilience Is a Decision, Not a Stack

will peixoto — Thu, 23 Oct 2025 16:26:18 +0000

The cost of high availability: resilience starts with strategic decisions, not technology.

🇧🇷 Read in Portuguese →

After a major outage like the October 2025 event in AWS's us-east-1 region, the smoke clears and the same questions that haunt CTOs and architects always resurface:

"Should we be multi-region?" Or worse (and a little nostalgic): "Should we go back to on-premises?"

The truth is that the right answer is rarely technical.

It's strategic, first and foremost.

As Werner Vogels (Amazon CTO) likes to put it in his talks:

"Everything fails, all the time."

And that's exactly it. The central question isn't whether it will fail, it's when it will fail and how prepared you'll be when that inevitable moment arrives. Because it will arrive. Whether you're in the cloud, on-premises, or running a complex multi-cloud setup.

What really separates resilient teams isn't the absence of failure. It's the speed, clarity, and effectiveness with which they respond and recover.

And that's where real architectural maturity lives: resilience isn't about choosing "multi-region" or "on-premises." It's about understanding the inherent risk, documenting the choice transparently, and reacting with a plan.

1. The Context Behind the Question: The Paradox of Visible Failure

Every time there's a big outage, I notice technical teams and executives splitting into two extreme reactions, both driven by fear and pressure:

"We need to go multi-region, now! Cost is secondary!"
"See? The cloud isn't reliable. We should have stayed on-premises, where we had control!"

Both extremes are dangerous shortcuts.

Multi-region is not a vaccine against downtime, and going back to on-premises is not a synonym for control; it just moves the maintenance complexity onto you.

A Crucial Point of Reflection: The cloud doesn't fail more than a traditional data center, it just fails in a way that is more visible, shared, and, ironically, democratic. On AWS, problems scale globally and become trending topics in minutes. On-premises, they hide behind scattered logs, long repair times, and, often, they only hit you. Honestly: do you believe your company has a greater capacity than AWS (or any major cloud provider) to manage physical security, cabling, power, cooling, and, above all, the resilience of infrastructure at global scale?

Migrating or evolving an architecture, at its core, is not about "throwing everything away" or "buying the hype." It's about keeping what's good in the legacy and removing what limits growth.

This isn't a black-and-white fight of "Cloud vs. Data Center." It's a strategic game of Conscious Resilience vs. Comfort Zone.

2. Cost vs. Continuity: The Economics Behind the 9s

In the world of infrastructure, each additional "9" in your SLA (Service Level Agreement) doesn't just cost more. It costs exponentially more.

To illustrate the real impact of each availability tier, here's the maximum allowed downtime per year:

99% (two 9s): about 3.65 days of downtime per year. Cost and complexity: baseline (1x).
99.9% (three 9s): about 8 hours and 46 minutes of downtime per year. Cost and complexity: 1.5x to 2x the baseline.
99.99% (four 9s): about 52 minutes of downtime per year. Cost and complexity: 2x to 3x. Requires Multi-AZ and strong automation.
99.999% (five 9s): about 5 minutes of downtime per year. Cost and complexity: 3x and up. Requires flawless automation and, often, a Multi-Region architecture.

Each tier jump means more than doubling or tripling infrastructure; it also demands operational review and sophistication. And here's the catch: every additional 9 has to be justified by ROI (Return on Investment), never by technical pride.

📢 The Non-Negotiable Factor: Regulation. For sectors like finance, healthcare, or telecom, the SLA choice isn't always purely economic. Often, the availability requirement (and the data recovery capability, the RPO) is imposed by law or industry rules. In those cases, the debate isn't whether you can afford it, but how to hit the legally mandated SLA at the lowest possible cost and complexity, because the cost of a regulatory fine outweighs any technical saving.

Rule of Thumb for Complexity:

High availability (within a single region): can cost 1.5x to 2x the baseline.
Multi-Region (Active/Passive): can cost 2.5x to 3x.
Multi-Cloud (Active/Active): almost never reduces risk. On the contrary, it usually increases the failure surface and operational complexity.

3. Conscious Decisions: The Virtue of ADRs

Every architectural choice is a commitment based on a context, and that context is volatile. Without a record, the context is lost, which condemns us to redo decisions, revisit old discussions, and rack up unnecessary cost.

That's where the practice of ADRs (Architecture Decision Records) becomes crucial. These aren't 50-page documents. They're short records that capture the decision, the reason, and the accepted risk at a specific point in time.

Example ADR (focused on the accepted risk):

# ADR-014: Do not use multi-region replication in the MVP

Context:
- Current traffic < 10 req/s.
- Multi-region replication cost is estimated at > 3x current cost.

Decision:
Keep a single-region architecture (using Multi-AZ for intra-region HA),
with a daily cross-region backup.

Review Trigger:
After reaching an average of 100 req/s, or when the current SLA (99.95%)
starts causing business impact.

Accepted Risk / Consequence:
Risk of total service downtime if an outage affects the entire region
(estimated RTO of 4 hours for cross-region recovery).

An ADR doesn't prevent failure. But it prevents failure from catching the team by surprise, because the risk was mapped, accepted, and justified by the business. It's the map for future discussions.

4. Selective Resilience: Not Everything Needs HA (and That's Fine)

Selective resilience is a virtue of economy and clarity. Not every service needs global redundancy. Spending finite resources (money and engineering attention) on unnecessary redundancy is one of the biggest forms of waste in architecture.

Prioritize High Availability (HA) only for what truly matters:

Direct revenue functions: components critical to the financial transaction (e.g., checkout and payment APIs).
The critical customer journey: functions that block the core value of the product (e.g., login or the main catalog).
Regulatory and legal risk: services where failure triggers legal fines or breaches a penalizing contractual SLA.
Integrity of critical data: where data loss violates an acceptable RPO (e.g., mandatory data retention systems).

Everything else? It can be restored through a well-defined recovery playbook. Batch jobs, internal back-office systems, and dashboards can tolerate minutes (or even hours) of downtime, as long as the reprocessing plan is clear.

High availability with no purpose is like installing an airbag on a bicycle. It's a sophisticated solution to a problem that doesn't exist in that context.

5. Managed != Failure-Proof: The Serverless Mindset

A common mistake is believing that using serverless services (Lambda, DynamoDB, SQS, EventBridge) is a synonym for immunity to failure. It isn't.

Failure will come, and often from where you least expect it, because the serverless paradigm shifts the risk surface, it doesn't remove it.

The key point is this:

Managed services reduce your operational surface (you don't manage the OS, patching, or capacity), but they don't replace good design and preparation.

During the October 2025 us-east-1 outage, plenty of 100% serverless applications went down. Not because serverless failed them, but because they leaned on a single region. When DNS resolution for the regional DynamoDB endpoint broke, anything pinned to us-east-1 (directly, or indirectly through a global control plane like IAM or STS) broke with it. Multi-AZ wouldn't have saved you here: the endpoint was regional, not zonal. And the applications that recovered slowest were frequently the ones whose code answered the failure with aggressive, unbounded retries, turning one outage into a self-inflicted retry storm.

Real resilience doesn't come from AWS. It comes from the architecture you design on top of it.

6. The Decision Belongs to the Business, the Clarity to the Architect

The difference between "having an opinion" and "having influence" lies in your ability to translate technical complexity into strategic clarity. Your job isn't to scare the board with jargon. It's to give them the visibility they need to decide consciously.

Experience has taught me that a team's maturity can be measured precisely by its ability to ask the right question:

❓ Where Is Your Team's Maturity?

Immature teams focus on the tool:

They ask: "Which stack solves this?"
They ask: "Should we use K8s or Serverless?"
They ask: "What does Netflix do?"

Mature teams focus on risk and the business:

They ask: "What risk are we willing to accept for this cost?"
They ask: "What RTO/RPO does the end customer require from this service?"
They ask: "What does our business need to survive a disaster?"

The result is that two teams can use the exact same cloud: one scales predictably, the other lives in panic mode. The difference isn't the cloud. It's the level of understanding, documentation, and technical humility behind the decisions made.

The Common Trap: Who hasn't heard an executive say, "Technical decisions are up to the Architecture team"? What they're actually doing is transferring responsibility for defining business risk. Your team defines the HOW (the stack), but the Business defines the HOW MUCH (the acceptable RTO and RPO). It's your job to hand the question back, so the risk decision belongs to the business.

Translating Resilience Concepts for Leadership:

(After all, who hasn't heard: "Now translate that so I can understand it!")

1. Multi-Region Failover

   Translation: Insurance against catastrophe. It guarantees that a
                regional disaster won't take us offline for days,
                reducing revenue loss to a few hours.

   Question:    How many hours (or minutes) of downtime can the
                business accept for service X if an entire region goes down?
------------------------------------------------------------------
2. Active-Active Setup

   Translation: Maximum, uninterrupted availability. It lets us perform
                any maintenance or update without ever impacting the
                end customer.

   Question:    Does service X need to be 100% continuous? Can we afford
                a 15-minute maintenance window?
------------------------------------------------------------------
3. RTO / RPO

   Translation: Defining the limit of the damage. These are the numbers
                that tell us what we can lose, and for how long, before
                fines or reputation become unsustainable.

   Question:    How much data (RPO) can we lose, and how long (RTO) does
                the team have to restore the service before the business breaks?
------------------------------------------------------------------
4. SPOF (Single Point of Failure)

   Translation: The Achilles' heel of revenue. It's the weak point that,
                if broken, paralyzes the whole company. This is where the
                risk must be zero.

   Question:    If this component goes down, what's the financial loss
                in 1 hour?

7. Conclusion

There is no such thing as a failure-proof architecture.

But there is such a thing as an organization that is surprise-proof.

And it starts with conscious decisions, documentation (the ADRs), and the technical humility to accept that error and risk are part of the equation.

Teams that understand the "why" before diving into the "how" build systems that don't just scale. They survive, and grow predictably.

8. Essential References

For anyone who wants to go deeper on risk decisions and architectural patterns, these are the documents we use as a foundation for resilience on any cloud (with a focus on AWS):

AWS Well-Architected Framework (Reliability Pillar): the fundamental guide to understanding disaster recovery (DR) and high availability (HA) principles. Reliability Pillar
Disaster Recovery of Workloads on AWS: the key document for going deeper on RTO/RPO and choosing between patterns like Pilot Light and Active-Active. DR Whitepaper
DynamoDB Global Tables: an excellent practical case study of HA at the data layer, abstracting away multi-region complexity. DynamoDB Global Tables
EventBridge Resilience Guide: essential for anyone working with serverless, focused on event-based resilience patterns. EventBridge Resilience Guide
AWS Post-Event Summary: Amazon DynamoDB Service Disruption in US-EAST-1 (Oct 19-20, 2025): the primary source on the outage referenced in this article, straight from AWS. Service Disruption Summary

9. Essential Resilience Glossary

So everyone is on the same page, here are some key terms used in this article, explained simply:

High Availability (HA): the ability of a system to keep operating even when one or more of its components fail. Measured in "9s" (e.g., 99.99%).
Outage: an unplanned interruption of a service: the service goes down.
On-premises: infrastructure and data centers you own, physically located at the company (not in the cloud).
Multi-Region: using data centers in two or more different geographic cloud regions (e.g., US East and São Paulo) for maximum protection against regional disasters.
Multi-AZ (Multi-Availability Zone): using two or more Availability Zones (isolated, nearby data centers) within the same cloud region. This is the baseline HA pattern.
SLA (Service Level Agreement): a formal agreement defining the level of service a provider is expected to deliver (usually measured in uptime).
ROI (Return on Investment): a financial metric measuring the relationship between money earned (or saved) and money invested.
ADR (Architecture Decision Record): a short document recording an architectural decision, the reasoning, and the accepted risk at a specific point in time.
RTO (Recovery Time Objective): the maximum acceptable time a system can be down after a failure.
RPO (Recovery Point Objective): the amount of data (measured in time, e.g., 5 minutes) that can be lost during a disaster event.
Serverless: a cloud computing model where the provider manages all the infrastructure and the developer focuses only on the code, paying only for usage.
Circuit Breaker: a software pattern that, when a dependency starts failing repeatedly, "opens" the circuit to protect the rest of the application from cascading failures.

Want to go deeper on resilience, orchestration, and the architect's strategic role in the serverless era?

I write about this kind of thing regularly. If this resonated, I'd like to hear it: how did your architecture hold up during the last regional outage, and, more importantly, which trade-offs had you already written down before it happened? Find me on LinkedIn and let's compare notes.

I'll be at ServerlessDays São Paulo on November 8th, at Cubo Itaú, discussing how to go beyond the stack and build systems that not only function — but thrive in chaos.

Come join the conversation! 🚀