Back to Blog
CloudOctober 10, 20255 min read

How We Cut Our AWS Bill by 40%: The Full Breakdown

A retail company running servers 24/7 for a business that closes at midnight. The biggest saving wasn't sophisticated — it was just turning things off. Plus NAT Gateways, right-sizing, and VPN cleanup.

#AWS#Cost Optimization#Cloud#FinOps

How We Cut Our AWS Bill by 40%: The Full Breakdown


Cloud cost visibility — knowing where every dollar goes

Cloud cost visibility — knowing where every dollar goes


We were a hard discount retailer. Physical shops, distribution centres, offices — no e-commerce, no online orders, just the kind of business where your operational hours are completely predictable. Servers needed to be up from around 5am when the first distribution shifts started, through to midnight when the last store reporting and end-of-day processing finished.


Notice anything interesting about that window? It leaves roughly five hours every night when the platform is, for all practical purposes, idle.


Our AWS bill did not reflect this. It was growing steadily month on month, and when we finally sat down to understand why, the answer was equal parts obvious and embarrassing. This is the story of what we found and what we did about it.


The Audit: Start with Tagging


Before you can cut costs, you need to see them clearly. Our first problem was that about a third of our EC2 instances had incomplete or missing tags. We could not attribute their cost to a specific team, service, or environment. We were paying for infrastructure we couldn't identify.


We enforced tagging across the board — environment, team, service, cost-center — and made it mandatory going forward via AWS Config rules that flag untagged resources automatically. Only once we could see the full cost breakdown clearly could we make real decisions.


The rough picture that emerged: EC2 was the dominant cost as expected, followed by data transfer, NAT Gateway charges that were suspiciously high, and VPN-related network costs that deserved a closer look.


Strategy 1: Scheduled Shutdowns — The Obvious Win


This one still makes me laugh a little, because it was so simple.


We had non-essential EC2 instances — development environments, internal tooling, staging servers, batch processing workers — running 24 hours a day, seven days a week. For a business that operates 5am to midnight. That's five hours every night and a chunk of every weekend being paid for with essentially nothing to show for it.


We implemented scheduled start/stop using EventBridge rules and AWS Systems Manager. Non-critical instances stop at 1am, start at 4:30am. Critical services — production databases, monitoring, inventory systems, POS connectivity — stay up. Everything else goes to sleep.


bash
# EventBridge rule to stop non-essential instances at 1am
aws events put-rule   --name "NightlyShutdown"   --schedule-expression "cron(0 1 * * ? *)"   --state ENABLED

# Tag instances you want included in the shutdown
# instances tagged with scheduled-shutdown=true are targeted by the SSM automation
aws ec2 create-tags   --resources i-0abc123 i-0def456   --tags Key=scheduled-shutdown,Value=true

The SSM Automation document handles the actual start/stop against instances with the right tag. You define the tag, the schedule, and which environments are in scope. Development, staging, and internal tooling are in. Production core services are out.


Saving: approximately 20% of the monthly bill. The largest single improvement we made, and it took about a day to implement properly.


Strategy 2: Right-Sizing EC2


Once we could see our instances clearly, the over-provisioning was hard to look at.


We pulled 30 days of CloudWatch CPU and memory metrics across the fleet. A significant portion of instances were averaging CPU utilization in the single digits. We had instances provisioned for anticipated future load that never arrived, and nobody had revisited the sizing since.


The process:

  • Export 30-day average CPU, memory, and network metrics per instance from CloudWatch
  • Flag instances averaging below 20% CPU as candidates
  • Select a target type at roughly 50-60% of current capacity
  • Validate in staging, then resize production during a low-traffic window

  • Production application servers moved down to appropriately sized instances. Staging moved to smaller types entirely. For services with predictable traffic patterns, we added Auto Scaling with CPU-based target tracking — so they scale out during business hours and back in overnight rather than sitting at peak-provisioned capacity around the clock.


    Saving: roughly 10-12% of the overall bill.


    Strategy 3: The NAT Gateway Surprise


    This one was not obvious until we looked at the numbers.


    NAT Gateway charges on two dimensions: hourly (modest) and per GB of data processed (not modest). Our data processing charges were far higher than they should have been, so we enabled VPC Flow Logs and used Athena to query the traffic patterns.


    The culprit: EC2 instances were routing all traffic to S3 and DynamoDB through the NAT Gateway. Every API call to S3, every database read from DynamoDB — all of it flowing through NAT and being charged per GB.


    The fix is free: VPC Gateway Endpoints for S3 and DynamoDB route that traffic directly within the AWS network, bypassing the NAT Gateway entirely.


    bash
    # VPC Endpoint for S3 — free, immediate impact
    aws ec2 create-vpc-endpoint   --vpc-id vpc-0abc123def456789   --service-name com.amazonaws.us-east-1.s3   --route-table-ids rtb-0abc123 rtb-0def456 rtb-0ghi789
    
    # VPC Endpoint for DynamoDB
    aws ec2 create-vpc-endpoint   --vpc-id vpc-0abc123def456789   --service-name com.amazonaws.us-east-1.dynamodb   --route-table-ids rtb-0abc123 rtb-0def456 rtb-0ghi789

    NAT Gateway data processing charges dropped significantly within the first week. The remaining traffic through NAT is legitimate internet-bound traffic, which is expected.


    Saving: roughly 8% of the overall bill.


    Strategy 4: Site-to-Site VPN Optimization


    We had Site-to-Site VPN connections between AWS and two types of on-premises systems: distribution centre infrastructure for inventory and supply chain sync, and POS systems across 200 store locations. The connections themselves weren't expensive. The traffic routing was.


    Two hundred stores all phoning home to AWS means a non-trivial volume of constant traffic — end-of-day sales reporting, inventory updates, price sync, shift data. The problem was that our routing tables were too broad. Traffic that could have stayed within AWS was taking unnecessary detours through the VPN tunnels and back, and a good portion of store-to-store or store-to-warehouse communication was being routed via AWS when it didn't need to be.


    We audited every VPN route and tightened the tables to only send traffic that genuinely needed to traverse the tunnel. AWS-internal traffic stayed internal. We also enabled CloudWatch monitoring on all VPN tunnels to track throughput and data-out bytes per connection — previously this was completely invisible, which was a big part of why it had gone unnoticed.


    For high-volume, latency-tolerant sync jobs — end-of-day store reports, bulk inventory reconciliation — we moved from continuous streaming to scheduled batch jobs with compression. The data volume going over the tunnels dropped significantly, and the stores didn't notice the difference.


    Saving: meaningful reduction in data transfer costs across a large number of connections, plus visibility we should have had from day one.


    Strategy 5: Tagging, Monitoring, and Staying Honest


    Cost optimization is not a project with an end date. Without governance, costs drift back up within months as new resources get provisioned without discipline.


    What we put in place:


    Mandatory tagging enforcement: AWS Config rules flag any untagged resource within minutes of creation and notify the owning team. New resources without proper tags get escalated before the next billing cycle.


    Cost Anomaly Detection: AWS Cost Anomaly Detection monitors per-service spend and alerts via SNS — and from there to Slack — when any category increases more than 20% in a rolling 7-day window. We caught a misconfigured data transfer pattern this way within four days of it starting, rather than discovering it at month end.


    Throughput and bandwidth dashboards: CloudWatch dashboards covering NAT Gateway throughput, VPN data-out, and inter-AZ traffic. These numbers had been invisible before. Making them visible means abnormal patterns get noticed quickly.


    The Result


    No single strategy got us to 40%. It was the combination:


  • Scheduled shutdowns: ~20% — the standout
  • EC2 right-sizing + Auto Scaling: ~12%
  • NAT Gateway fix: ~8%
  • VPN route optimization + data transfer: meaningful but harder to measure precisely
  • Tagging and governance: not a direct saving, but prevents the bill from growing back

  • The biggest lesson: the most impactful change was also the simplest. We were a retail business paying for servers to run overnight while nobody was shopping. Turning them off on a schedule required no architectural changes, no long migration project, no budget approval for new tooling. It required someone asking the question nobody had asked: why are these on right now?


    Ask that question about your infrastructure. The answer might surprise you.