Charity is the co-founder and CTO of Honeycomb. She pioneered the concept of modern observability, drawing on her years of experience building and managing massive distributed systems at Parse (acquired by Facebook), Facebook, and Linden Lab, building Second Life. She is the co-author of Observability Engineering and Database Reliability Engineering (O'Reilly). She loves free speech, free software, and single malt scotch.
Shipping code should be a boring and regular occurrence, says Charity Majors in this episode of HangarDX podcast where she doubles down on what she has been preaching for years: continuous deployment (not delivery!) should be the ultimate goal of every software team. Because, when the feedback loop between writing code and using it is as short as possible, that's where the magic happens.
Charity and Ankit also discuss deploying on Fridays, the root cause of what she calls the software engineering death spiral, code reviews as knowledge sharing, the difference between ownership and blame, and the role of observability.
I have a sticker that says the fear of deployment is the largest source of technical debt in almost every organization. And it's not deployments, it's fear of deployment. The fear comes from breaking things. It comes from the shipping code and then the pit of your stomach falling out as you watch all the numbers go red and everybody in the office turns to look at you. That builds scar tissue.
I come from the ops side of software, and for years, for decades, the operations profession was protecting production with crossed swords. But recently we’ve been building technology, technical stacks, and tools that are more resilient, safer, faster, and more friendly. We shifted from thinking of production like our glass castle and more like a playground.
OK, you can’t do continuous deployment to mobile devices. But mobile developers can set up a sort of CI/CD thing that gets the mobile app from the IDE through running tests to a place where they can download things internally. No one thinks you’re a bad developer if you don’t get your app from your IDE to God's ear in 30 seconds.
It’s about cognitive carrying costs. Trying to keep the interval between when you're writing the thing and when you're using it short is because you will never know more about the change than you do right in that moment. You know what you're trying to do, how you did it, what you tried, what worked, what didn't, what the variables are called, what the functions are.
That is your best chance of finding the problems in your code. If you don't find it in that moment, it's probably not going to be you who finds it. It's probably going to be a user or someone weeks, months, or years down the line.
It's not about saying people should deploy on Fridays. It's not actually about Fridays. What it is about is that when you merge your code, you should expect it to go out immediately. It should happen by default. It shouldn't be something that only happens rarely, once in a while.
There's this great saying that shipping is the heartbeat of your company. It should be regular. It should happen consistently. It should be boring. It should happen so often that it's just the normal state of things. That is the safest state for any company to be in.
I don’t like blanket statements, so I won’t say everyone should deploy on Fridays. Some organizations may have looked at the pros and cons and decided it’s not good for them to deploy on Fridays. But I don't think people should pat themselves on the back for it and say they don’t ship on Fridays because they care about people. That's exactly backwards! If you care about your people, then you make sure that they're not getting woken up and called after hours at all times!
Have you ever used a cool mobile app, looked up the company behind it, and realized they have 800 developers? That’s what happens when deploys are slow and flaky, and I call it a software engineering death spiral. It’s when everything takes forever, people have to context switch a lot, and spend a lot of time waiting on each other. Things are backing up, so they can't move very fast, so they have to hire more people - SREs, release engineers, PMs, and more managers. More and more people are doing the same amount of work. And everyone's still stressed. Because they can’t hire their way out of that problem. In fact, the more people they add, the worse it's going to get. And the only way you can fix it is to start fixing the systems of it.
I work for an observability company so I'm biased, but the truth is that you can't fix what you can't see.
Bring a lot of observability, spend some time training and upleveling people's skills when it comes to navigating production in general, add feature flags, add guardrails around your deployment process, and do little things like trying to close feedback loops by deploying several times a day. Anytime you can shorten those feedback loops, the whole org benefits.
Ownership is an important part of software engineering. Just because someone isn't pushing the button doesn't mean a developer shouldn't be aware. You don't merge and walk out the door.
We've come so far over the past few years in terms of expecting developers to own their code and production. And it doesn’t mean that we don’t need SREs anymore. The SRE profession is changing. SREs are experts in extremely complex systems. It's not the old kind of SREs where SREs carry the pager and the software engineers don't.
With blame, there is some amount of shame involved. The undertone is that you should have known better. Or you could have a conversation with the developer and tell them that they need to write more tests. You can acknowledge reality without naming and shaming people. And I think that is something where the tone is set at the top. If the person at the top is like looking for scapegoats, there's very little that the rank and file can do to overcome that.
The main point of code review is to generate a shared understanding, to ensure that at least one other person really knows and has a stake in what's going on. So far — and AI may change all this — the real product of any software engineering team is shared ownership and shared understanding of a corpus of work.
Also, just the act of having to talk through what you're doing, the choices that you've made, typically makes you find a lot of problems in your own code, that you otherwise might not have if you were just like, done, merge it.
I think you can't have good observability for AI unless it's embedded in good software observability. You can't just understand your model in isolation. From the moment that you gather your inputs to execute the software, calls out to various APIs, data storage systems, calls to your model, all the way to the point where your user gives you feedback on the read. That is one big trace-shaped problem. Client cardinality matters more than ever. Context matters more than ever.
Instrumentation should also become easier because we should be able to say, "Hey, Claude, get me traces." AI is good at ingestion, cost-saving, and anomaly detection.
We're going to move away from static dashboards and much more towards workflows. I also think that the entire industry is realizing that tight feedback loops are a necessity for building good AI software.
00:00 Introduction to Fearless Deployments
01:37 Understanding the Fear of Deployment
07:13 Shipping Should Be Boring
09:36 Continuous Deployment and Trunk-Based Development
10:38 Prerequisites for Continuous Deployment
12:27 Managing Production Breaks
16:41 The Importance of Blameless Culture
23:43 Code Reviews and Shared Ownership
29:39 Identifying Deployment Problems
32:24 Strategies for Fixing Delivery Issues
37:24 The Future of AI in Observability and Deployments