Writing

Adventures in S3 Cost Optimization

I recently worked on a project to dial down the burn on an AWS bill. There was the standard stuff like purchasing savings plans/RIs and getting rid of orphaned EBS volumes (all those lonely EBS orphans wandering the streets), but S3 was where a big chunk of savings lived.

There were three main levers there: finding a bunch of uncompressed logs, crafting some storage tiering for things that could roll off of Standard access, and setting up an S3 gateway endpoint for the VPC.

.gz was a lie

The biggest bucket was a log bucket — many terabytes. Spot checking the log files, they all had gzip extensions, but seemed larger than I’d expect given the logging config (dozens of MBs vs single digits). When I pulled one down to open it, gzip choked on it.

Long story, short, I looked at the encoding and figured out it was actually raw JSON. womp. womp. compress gzip was missing from the fluentd config for the application, but .gz was specified in the file name template.

The fix? Caveman brain said “we could run a local script to loop through all the files and zip them”, but my wiser angels knew that there were millions of files and that would take forever. S3 Batch Operations to the rescue. I Lambda-ized the script I would have written for local and had the Batch Operation process the bucket: find unzipped files, gzip them, write new objects, then a day later a lifecycle rule expired the originals.

Result: ~$10k in annual savings

What do we use this stuff for?

The next thing I looked at was Lifecycle rules and storage tiering. Everything: logs, FE assets, customer uploads — was in Standard tier. I looked at turning on Intelligent-Tiering, but given the size of the buckets, it would have eaten a big chunk of the cost savings I just found. So I did a one-time Storage Class Analysis instead and waited a day for the results.

As I expected, the log bucket was never accessed (or we’d have known logs weren’t gzipped). And the other large buckets had a predictable drop off. After 180 days, access dropped almost to zero for most objects.

I knew we were going to want to work with logs more, so didn’t want to toss them into Standard-IA (Infrequent Access) by default — even though actual use would still likely be infrequent, so I opted for a hedge. Everything older than a year goes to IA.

In writing the Lifecycle rule to handle rotation, I came across a little tidbit I hadn’t known about before: All the IA tiers have a minimum object charge of 128 KB per object—so buckets with lots of small objects get hit hardest. For the log buckets, this was fine; they were all big boys. But it didn’t make sense for all the other buckets.

For the non-log buckets, I implemented a Lifecycle rule that rotated objects to IA after 180 days, but only if they were larger than 128 KB; otherwise, in most cases it was cheaper for them to stay in Standard tier.

Result: Another ~$15k in annual savings

Finally, I dug into my least favorite AWS billing line item: EC2-Other. It’s just a grab bag of often nearly invisible costs that can sometimes add up to quite a lot. No one ever really owns it and the line item descriptions tend to be… not helpful.

Hmm… why is Amazon Elastic Compute Cloud NatGateway nearly $2k a month? That seemed high relative to the overall compute bill. I suspected it might be S3 but wanted to confirm. I used VPC Flow Logs via Athena to confirm. If you’ve got flow logs in S3, you can point Athena at them (or use “Create table from Flow Logs” in the VPC console) and run something like this to see where the NAT bytes went—filter by your NAT gateway’s ENI as interface_id:

SELECT dstaddr, SUM(bytes) AS total_bytes
FROM vpc_flow_logs
WHERE interface_id = 'eni-xxxxxxxx'   -- NAT gateway ENI from the VPC console
  AND date BETWEEN '2025-01-01' AND '2025-12-31'
GROUP BY dstaddr
ORDER BY total_bytes DESC
LIMIT 20;

Top destinations by volume; cross-reference the IPs with AWS IP ranges (or just look up the big ones) and you’ll see S3/ECR prefixes.

Sure enough, most of the NAT traffic was hairpinning right back into AWS for S3 and ECR.

The fix here is simple: VPC endpoints. I added an S3 gateway endpoint and ECR interface endpoints and associated them with the route tables (S3) or subnets (ECR) for the app. The ECR proportion was only 10% of the overall NAT traffic, but at that volume the interface endpoint paid for itself. (Gateway endpoints are free.)

Result: ~$18k in annual savings

The tally

$43k in savings isn’t bad for a few hours’ work spread across a couple of days. I think that’s part of the reason I’ve always enjoyed cost optimization. The impact is usually really quick and meaningful. There’s often a decent amount of low hanging (if somewhat obscure) fruit that is satisfying to pluck. As Bluey’s mum says: sometimes boring stuff is important too.


Fish Stick: A Stateless Incident Management Bot for Slack

I just released Fish Stick, a stateless incident management bot for Slack. I’ve built this bot six+ times at different jobs, so figured it was time to stop recreating it and share it with the world.

The pitch: Most incident management tools are either too simple (basic Slack workflows) or too complex (enterprise platforms with a million knobs). Fish Stick sits in the middle—more powerful than workflows, simpler than enterprise tools.

Fish Stick Demo

Key features:

  • Random channel names (incident_furious_chicken, incident_brave_penguin)
  • Timeline logging with timestamps
  • Threaded team updates to stakeholders
  • Incident commander tracking and handoffs
  • Auto-generated timeline reports from channel history
  • Private incident support
  • Test mode for game days

The interesting part: it’s completely stateless. No database. No web interface. No OAuth flow. For this use case and this niche, Slack as DB is good enough.

All incident data lives in Slack:

  • Incident metadata → channel properties and pinned messages
  • Timeline → messages in the incident channel
  • Summary → pinned message

You can restart the bot anytime without losing anything. This keeps deployment dead simple.

Built it in TypeScript with the Slack Bolt framework. Supports Socket Mode for local dev (no public URL needed) and HTTP mode for production. Takes about 5 minutes to set up if you use the app manifest.

I’ve pared down the features quite a bit from what I have in other versions of the bot to keep things simple, but will be adding back some things like webhooks over time.

It’s MIT licensed. If you run incidents in Slack and want something between “too basic” and “too much,” check it out: github.com/chrisdodds/fishstick


AI Layoffs

Paycom, OKC’s biggest “tech” company, announced layoffs today that they blamed on AI. Lots of other companies have done the same recently (Microsoft, Saleforce, etc.)

The cynical (but true!) counter-narrative is that AI is just an excuse. Companies have always limped along with pointless inefficiencies.

After a couple years of LLM work, my read is that most “AI optimization” is just someone finally asking obvious questions about bad processes. It’s in-house consulting dressed up as innovation. AI mostly works as a permission structure to kill dumb workflows that could’ve been fixed 20 years ago.

Example: many moons ago I worked at a company where field offices entered truck weights in a spreadsheet, printed and faxed them, and someone at corporate re-typed the numbers into Access. One hour of questioning turned that job into linked Excel sheets. I guess we could’ve slapped an “AI transformation” label on it.

That feels like what’s happening now. “Rub some AI on everything” creates the cover story, but the real driver is people finally saying: “Why are we doing this?”

Using LLMs for meaningful work in a consistent/deterministic way is hard. These companies didn’t all become AI experts overnight.

None of this makes the layoffs less brutal for the people caught up in them, but it does punch some more holes in the AI-is-eating-the-world story.


Monitor Available IPs with Lambda and CloudWatch

I ran into a situation where I needed to keep track of available IPs related to an AWS EKS cluster and couldn’t find any off-the-shelf tooling in AWS or otherwise to do so.

Tangential gripe: The reason I needed the monitor is because EKS doesn’t support adding subnets to a cluster without re-creating it and the initial subnets that were used were a little too small due to reasons. I wanted to have a sense of how much runway I had pending AWS fixing the gap or me implementing a work around.

So, I cobbled together a Lambda function to pull the info and pipe it into CloudWatch.

Gist here

I’m using tags to scope the subnets I want to track, rather than piping in everything – since CloudWatch custom metrics cost money. But you could use whatever filters you wanted.

After getting the data into CloudWatch, I was quickly reminded that you can’t alarm off of multiple metrics, so I used a math expression (MIN) to group them instead. This works for up to 10 metrics (this post should really be titled “The numerous, random limitations of AWS”), which luckily worked for me in this instance.

Then I setup an alarm for the threshold I wanted and tested it – it worked. Fun times.


How to Live-Rotate PostgreSQL Credentials

OK, I didn’t actually learn this today, but it wasn’t that long ago.

Postgres creds rotation is straightforward with the exception of the PG maintainers deciding in recent years that words don’t mean anything while designing their identity model. “Users” and “Groups” used to exist in PG, but were replaced in version 8.1 with the “Role” construct.

In Postgres, everything is a “Role.” A user is a role. A group is a role. A role is a role. If you’re familiar with literally any other identity system, just mentally translate “Role” to whatever makes sense in context.

Now that we’ve established this nonsense, here’s a way of handling live creds rotation.

CREATE ROLE user_group; -- create a role, give it appropriate grants.

CREATE ROLE user_blue WITH ENCRYPTED PASSWORD 'REPLACE ME' IN ROLE user_group;

-- This one isn't being used yet, so disable the login.
CREATE ROLE user_green WITH ENCRYPTED PASSWORD 'REPLACE ME AS WELL' IN ROLE user_group nologin;

That gets you prepped. When you’re ready to flip things.

ALTER USER user_green WITH PASSWORD 'new_password' login;

Update the creds wherever else they need updating, restart processes, confirm everything is using the new credentials, etc. Then

ALTER USER user_blue WITH PASSWORD 'new_password_2' nologin;

Easy, peasy.