close

DEV Community

Cover image for The Day DNS Broke Our Deployment: Solving a Serverless Framework S3 Resolution Failure on AWS
saif ur rahman
saif ur rahman

Posted on

The Day DNS Broke Our Deployment: Solving a Serverless Framework S3 Resolution Failure on AWS

Introduction

As engineers, we often spend hours optimizing code, improving prompts, and scaling infrastructure.

But sometimes the biggest production issues come from something much simpler.

A DNS lookup.

Recently, while deploying a serverless AI application to AWS, I encountered an error that completely blocked deployment.

The application hadn't changed.

AWS was healthy.

Permissions were correct.

The deployment package was valid.

Yet every deployment failed.

The error looked like this:

Error:
getaddrinfo EAI_AGAIN serverless-framework-deployments-eu-north-1-xxxxxxxx.s3.eu-north-1.amazonaws.com
Enter fullscreen mode Exit fullscreen mode

At first glance, it looked like an AWS outage.

It wasn't.

This is the story of how a simple DNS resolution issue brought an entire deployment pipeline to a halt—and how we fixed it.

The Project

The application was an AI-powered Due Diligence Platform built with:

  • AWS Lambda
  • Amazon Bedrock
  • Amazon DynamoDB
  • Amazon SQS
  • Serverless Framework

Deployment flow:

Developer
     ↓
Serverless Framework
     ↓
S3 Deployment Bucket
     ↓
CloudFormation
     ↓
Lambda Functions
Enter fullscreen mode Exit fullscreen mode

Every deployment package is first uploaded to an S3 bucket created by the Serverless Framework.

Only after the upload succeeds does CloudFormation update the stack.

The Error

During deployment, the terminal suddenly returned:

serverless deploy

✖ Error:
getaddrinfo EAI_AGAIN serverless-framework-deployments-eu-north-1-xxxxxxxx.s3.eu-north-1.amazonaws.com
Enter fullscreen mode Exit fullscreen mode

The deployment stopped immediately.

No Lambda updates.

No CloudFormation changes.

Nothing.

First Assumption: AWS Was Down

The first thing I checked was AWS Service Health.

Everything was operational.

  • S3 was healthy
  • CloudFormation was healthy
  • Lambda was healthy

No incidents were reported.

Second Assumption: IAM Permissions

The next suspect was permissions.

I verified:

aws sts get-caller-identity
Enter fullscreen mode Exit fullscreen mode

Response:

{
  "Account": "123456789012",
  "Arn": "arn:aws:iam::123456789012:user/developer"
}
Enter fullscreen mode Exit fullscreen mode

Credentials were valid.

Permissions were correct.

Still failing.

Third Assumption: Serverless Framework Bug

I upgraded Serverless Framework.

npm install -g serverless
Enter fullscreen mode Exit fullscreen mode

Deployment still failed.

The Real Problem

The key clue was:

EAI_AGAIN
Enter fullscreen mode Exit fullscreen mode

This is not an AWS error.

It is a DNS resolution error.

Specifically:

EAI_AGAIN
=
Temporary DNS lookup failure
Enter fullscreen mode Exit fullscreen mode

The operating system could not resolve the S3 endpoint hostname.

The request never reached AWS.

How We Confirmed It

I manually tested DNS resolution:

nslookup google.com
Enter fullscreen mode Exit fullscreen mode

Intermittent failures appeared.

Then:

nslookup s3.eu-north-1.amazonaws.com
Enter fullscreen mode Exit fullscreen mode

The same issue occurred.

This confirmed that the problem existed locally.

Not in AWS.

Root Cause

The machine was using an unstable DNS resolver.

Under heavy network usage, DNS lookups occasionally timed out.

When Serverless Framework attempted to upload artifacts to S3:

Serverless
      ↓
DNS Lookup
      ↓
Failure
      ↓
Deployment Stops
Enter fullscreen mode Exit fullscreen mode

No connection to AWS was ever established.

The Fix

We switched to reliable public DNS servers.

Linux:

sudo nano /etc/resolv.conf
Enter fullscreen mode Exit fullscreen mode

Added:

nameserver 8.8.8.8
nameserver 1.1.1.1
Enter fullscreen mode Exit fullscreen mode

Then restarted networking:

sudo systemctl restart NetworkManager
Enter fullscreen mode Exit fullscreen mode

Validation

After updating DNS:

nslookup s3.eu-north-1.amazonaws.com
Enter fullscreen mode Exit fullscreen mode

Returned instantly.

Deployment succeeded:

serverless deploy
Enter fullscreen mode Exit fullscreen mode

Output:

✔ Service deployed successfully
Enter fullscreen mode Exit fullscreen mode

Additional Improvements

To avoid future issues, we added several safeguards.

Retry Logic

serverless deploy || serverless deploy
Enter fullscreen mode Exit fullscreen mode

Useful for CI/CD jobs when transient network issues occur.

Connectivity Check

Before deployment:

curl https://s3.eu-north-1.amazonaws.com
Enter fullscreen mode Exit fullscreen mode

If connectivity fails:

Stop deployment
Enter fullscreen mode Exit fullscreen mode

This prevents wasting build minutes on doomed deployments.

AWS Credential Validation

Added:

aws sts get-caller-identity
Enter fullscreen mode Exit fullscreen mode

to deployment pipelines.

This immediately detects expired or invalid credentials.

Lessons Learned

The biggest lesson was simple:

Not every AWS deployment error is actually an AWS problem.

Sometimes:

  • DNS fails
  • Local networking fails
  • VPNs interfere
  • Corporate firewalls interfere

And the cloud gets blamed.

Production Debugging Framework

When deployment issues occur, I now follow this order:

Step 1 — Validate AWS Credentials

aws sts get-caller-identity
Enter fullscreen mode Exit fullscreen mode

Step 2 — Validate Internet Connectivity

ping google.com
Enter fullscreen mode Exit fullscreen mode

Step 3 — Validate DNS

nslookup s3.eu-north-1.amazonaws.com
Enter fullscreen mode Exit fullscreen mode

Step 4 — Validate AWS Services

aws s3 ls
Enter fullscreen mode Exit fullscreen mode

Step 5 — Run Deployment

serverless deploy
Enter fullscreen mode Exit fullscreen mode

This process has saved hours of troubleshooting.

Final Thoughts

As engineers, we often expect complex problems to have complex causes.

This incident reminded me that some of the most disruptive failures originate from the most basic layers of infrastructure.

A single DNS lookup failure stopped an entire deployment pipeline.

The code was correct.

AWS was healthy.

The architecture was sound.

But none of that mattered until the network could resolve a hostname.

Sometimes the fastest fix isn't changing code.

It's understanding where the request actually fails.

Key Takeaway

Before blaming AWS:

  1. Check credentials
  2. Check connectivity
  3. Check DNS
  4. Check local networking
  5. Then investigate cloud services

You'll save yourself hours of debugging.

Top comments (0)