A Cautionary Tale of AWS Networking

+------------------------------------------------------------------------------+
|                                                                              |
|  Dhananjaya D R                   @/Blog   @/Software   @/Resume   @/Contact |
|                                                                              |
+------------------------------------------------------------------------------+


AWS Networking Costs Case Study
________________________________________________________________________________

I have been using AWS long enough to know the basics. But every time I think I 
know it, there is something new to learn. Sometimes those lessons come with a 
big price tag.

A couple of weeks ago, a peer asked me to audit their AWS costs. It was looking 
good after the standard optimizations, until I watched a TV show where the 
protagonist was told to "follow the money." That phrase stuck with me. I logged 
into their AWS Console and clicked through:

Billing and Cost Management -> Billing and Payments -> Bills. 

That’s when I saw it:

> $0.045 per GB Data Processed by NAT Gateways: 87,657.984 GB — USD 3,944.61

This scratched a part of my brain. The team uses terraform for infra automation.
When they bring up a machines, they attach a subnet that uses a default legacy 
configs. I traced back the routing table for that subnet and saw the 
route: 0.0.0.0/0 -> nat-gateway (10.0.4.xxx).

To confirm my suspicion, I logged into one of the EC2 instances they had spun up
and ran a traceroute to S3:

+------------------------------------------------------------------------------+
| ubuntu@ip-10-0-2x-x5:~$ sudo traceroute -n -T -p 443 s3.us-east-1.amaz....   |
|                                                                              |
| traceroute to s3.us-east-1.amazonaws.com (16.15.xxx.xxx), 30 hops max, 60 x  |
| 1  10.0.4.xxx     0.890 ms  1.173 ms  0.914 ms                               |
| 2  * * * |
| 3  16.15.xxx.xxx  3.808 ms  3.723 ms  3.583 ms                               |
+------------------------------------------------------------------------------+

You see, in the first hop (10.0.4.xxx), the traffic was going to the NAT 
Gateway. Even if you are making requests to S3 (an AWS service in the same 
region), if you are in a private subnet without specific configurations, that 
traffic is routed out through your NAT Gateway and back in. 

They were essentially paying money to talk to a service that lives next door.
    
 +----------+       +---------+       +----------+       +--------+ 
 │   EC2    │       │   NAT   |       |          |       |        |
 │          │------>│         │------>| INTERNET │------>|   S3   │
 │ INSTANCE │       │ GATEWAY │       |          |       |        |
 +----------+       +---------+       +----------+       +--------+

The solution? VPC Endpoints for S3, specifically what AWS calls a 
"Gateway Endpoint." A Gateway Endpoint is a special type of VPC endpoint that 
allows you to privately route traffic to S3 without going through your NAT 
Gateway or Internet Gateway. It's essentially a direct pipe from the VPC to S3.
Even better, Gateway Endpoints for S3 are completely free. No hourly charges, 
no data transfer charges. Nothing.

 +----------+       +----------------------+       +--------+  
 │   EC2    │       │                      |       |        |
 │          │------>│ VPC GATEWAY ENDPOINT │------>|  S3    │
 │ INSTANCE │       │                      |       |        |
 +----------+       +----------------------+       +--------+ 

However, I didn't stop there. The S3 traffic alone couldn't account for nearly
88TB of data. I looked at their AMIs. Surprise, surprise, they were 8 months 
old. Every time they brought up a machine, the boot script would run "apt 
update" and "apt upgrade" to pull the latest security patches. Because they 
hadn't updated the base image, every single new instance was downloading 
hundreds of MBs of updates. And because they didn't have the Gateway Endpoint
configured yet, all those repo calls were also going through the NAT Gateway.


                                            ,,,, 
                                      ,;) .';;;;',
                          ;;,,_,-.-.,;;'_,|I\;;;/),,_
                           `';;/:|:);{ ;;;|| \;/ /;;;\__
                               L;/-';/ \;;\',/;\/;;;.') \
                               .:`''` - \;;'.__/;;;/  . _'-._ 
                             .'/   \     \;;;;;;/.'_7:.  '). \_
                           .''/     | '._ );}{;//.'    '-:  '.,L
With great cloud power   .'. /       \  ( |;;;/_/         \._./;\   _,
comes great AWS bills!    . /        |\ ( /;;/_/             ';;;\,;;_,
                         . /         )__(/;;/_/                (;;'''''
                          /        _;:':;;;;:';-._             );
                         /        /   \  `'`   --.'-._         \/
                                .'     '.  ,'         '-,
                               /    /   r--,..__       '.\
                             .'    '  .'        '--._     ]
                             (     :.(;>        _ .' '- ;/
                             |      /:;(    ,_.';(   __.'
                              '- -'"|;:/    (;;;;-'--'
                                    |;/      ;;(
                                    ''      /;;|
                                            \;;|
                                             \/



The cost was one thing, but they were also facing performance issues. During 
capacity tests, they constantly faced connection timeouts. They had many 
speculations about the root cause, but none were solid. Let's go back to that 
terraform automation and the default subnet configuration. Because every 
machine they brought up used the same subnet, all traffic was funneled through 
one single NAT Gateway with one primary IP address.

+------------------------------------------------------------------------------+
| Pvt-RT: 0.0.0.0/0 -> NAT              Pub-RT: 0.0.0.0/0 -> IGW               |
|          |                                  |                                |
| +--------|------------------VPC (AZ-a)---------|---------------------------+ |
| | +------v----------------+        +------------v-------------+            | |
| | | Private Subnet        |        | Public Subnet            |            | |
| | |                       |        |                          |            | |
| | |  [ASG 1]    [ASG 2]   |        |       +--------+        +-------+     | |
| | |   (EC2)      (EC2)----|--------|------>| NAT-GW |------->|  IGW  |---> | |
| | |                       |        |       +--------+        +-------+     | |
| | +-----------------------+        |          |               |            | |
| |                                  +----------|---------------+            | |
| +--------------------------------------------|-----------------------------+ |
|                                              v                    Public Svr |
|                                     EIPs: 54.17...                192.0.2.29 |
+------------------------------------------------------------------------------+

I checked the CloudWatch metrics for the duration of one of their tests.

[*] ActiveConnectionCount: 773k (Limit is 55k per destination tuple)
[*] ErrorPortAllocation: 345k
[*] ConnectionAttemptCount: 226k

AWS NAT Gateways have a limit on ActiveConnectionCount. You are limited to 55k 
concurrent connections per unique destination. They were smashing into the 
SNAT port exhaustion limits. More than half of the connections were being 
dropped simply because the NAT Gateway ran out of ports to map the private IPs 
to its public IP.

Follow the money. It usually leads you to something.
________________________________________________________________________________