saif ur rahman

Posted on Jun 24

The Day DNS Broke Our Deployment: Solving a Serverless Framework S3 Resolution Failure on AWS

#aws #s3 #dns #serverless

Introduction

As engineers, we often spend hours optimizing code, improving prompts, and scaling infrastructure.

But sometimes the biggest production issues come from something much simpler.

A DNS lookup.

Recently, while deploying a serverless AI application to AWS, I encountered an error that completely blocked deployment.

The application hadn't changed.

AWS was healthy.

Permissions were correct.

The deployment package was valid.

Yet every deployment failed.

The error looked like this:

Error:
getaddrinfo EAI_AGAIN serverless-framework-deployments-eu-north-1-xxxxxxxx.s3.eu-north-1.amazonaws.com

At first glance, it looked like an AWS outage.

It wasn't.

This is the story of how a simple DNS resolution issue brought an entire deployment pipeline to a halt—and how we fixed it.

The Project

The application was an AI-powered Due Diligence Platform built with:

AWS Lambda
Amazon Bedrock
Amazon DynamoDB
Amazon SQS
Serverless Framework

Deployment flow:

Developer
     ↓
Serverless Framework
     ↓
S3 Deployment Bucket
     ↓
CloudFormation
     ↓
Lambda Functions

Every deployment package is first uploaded to an S3 bucket created by the Serverless Framework.

Only after the upload succeeds does CloudFormation update the stack.

The Error

During deployment, the terminal suddenly returned:

serverless deploy

✖ Error:
getaddrinfo EAI_AGAIN serverless-framework-deployments-eu-north-1-xxxxxxxx.s3.eu-north-1.amazonaws.com

The deployment stopped immediately.

No Lambda updates.

No CloudFormation changes.

Nothing.

First Assumption: AWS Was Down

The first thing I checked was AWS Service Health.

Everything was operational.

S3 was healthy
CloudFormation was healthy
Lambda was healthy

No incidents were reported.

Second Assumption: IAM Permissions

The next suspect was permissions.

I verified:

aws sts get-caller-identity

Response:

{
  "Account": "123456789012",
  "Arn": "arn:aws:iam::123456789012:user/developer"
}

Credentials were valid.

Permissions were correct.

Still failing.

Third Assumption: Serverless Framework Bug

I upgraded Serverless Framework.

npm install -g serverless

Deployment still failed.

The Real Problem

The key clue was:

EAI_AGAIN

This is not an AWS error.

It is a DNS resolution error.

Specifically:

EAI_AGAIN
=
Temporary DNS lookup failure

The operating system could not resolve the S3 endpoint hostname.

The request never reached AWS.

How We Confirmed It

I manually tested DNS resolution:

nslookup google.com

Intermittent failures appeared.

Then:

nslookup s3.eu-north-1.amazonaws.com

The same issue occurred.

This confirmed that the problem existed locally.

Not in AWS.

Root Cause

The machine was using an unstable DNS resolver.

Under heavy network usage, DNS lookups occasionally timed out.

When Serverless Framework attempted to upload artifacts to S3:

Serverless
      ↓
DNS Lookup
      ↓
Failure
      ↓
Deployment Stops

No connection to AWS was ever established.

The Fix

We switched to reliable public DNS servers.

Linux:

sudo nano /etc/resolv.conf

Added:

nameserver 8.8.8.8
nameserver 1.1.1.1

Then restarted networking:

sudo systemctl restart NetworkManager

Validation

After updating DNS:

nslookup s3.eu-north-1.amazonaws.com

Returned instantly.

Deployment succeeded:

serverless deploy

Output:

✔ Service deployed successfully

Additional Improvements

To avoid future issues, we added several safeguards.

Retry Logic

serverless deploy || serverless deploy

Useful for CI/CD jobs when transient network issues occur.

Connectivity Check

Before deployment:

curl https://s3.eu-north-1.amazonaws.com

If connectivity fails:

Stop deployment

This prevents wasting build minutes on doomed deployments.

AWS Credential Validation

Added:

aws sts get-caller-identity

to deployment pipelines.

This immediately detects expired or invalid credentials.

Lessons Learned

The biggest lesson was simple:

Not every AWS deployment error is actually an AWS problem.

Sometimes:

DNS fails
Local networking fails
VPNs interfere
Corporate firewalls interfere

And the cloud gets blamed.

Production Debugging Framework

When deployment issues occur, I now follow this order:

Step 1 — Validate AWS Credentials

aws sts get-caller-identity

Step 2 — Validate Internet Connectivity

ping google.com

Step 3 — Validate DNS

nslookup s3.eu-north-1.amazonaws.com

Step 4 — Validate AWS Services

aws s3 ls

Step 5 — Run Deployment

serverless deploy

This process has saved hours of troubleshooting.

Final Thoughts

As engineers, we often expect complex problems to have complex causes.

This incident reminded me that some of the most disruptive failures originate from the most basic layers of infrastructure.

A single DNS lookup failure stopped an entire deployment pipeline.

The code was correct.