Heroics Shouldn't Be Necessary

If they are, something is wrong.

Feb 19, 2024

Tl;dr

Existence of heroes is great
Situations that call for them should be rare
If you celebrate heroes, you encourage heroics - careful what you wish for
If you require heroics, something else more fundamental is wrong with your company
Hold your leaders accountable for results
Create a culture that enables heroes, but create the conditions that make heroics unnecessary

We All Need A Hero Sometimes

Everyone loves a good hero. They swoop in when times are bleak and save the day. They are then celebrated for their unfailing dedication to their team, the people around them, and thus appreciated for all the sacrifices they made.

Heroes are people, who against all odds, prevail and win the day. That is, no matter the challenges placed before them they endure through hardship, toil, and loss. Heroes are great people to have in your company and on your team. Ideally anyone on your team should be capable of stepping up, if the situation calls for it, to be a hero. Your team should be confident, efficient, and able to deal with a good disaster be it self-inflicted, externally caused, or just plain bad luck.

In this article we’ll look at a couple different types of heroics, some of the causes (and how to deal with them). After that we’ll dig into the long term effects of heroics are and lastly some basic strategies and practical steps you can take to mitigate the necessity for heroics.

But first…let’s be very clear about one thing.

Heroics Mean Something Went Wrong

When a hero emerges, it means something has gone wrong and we shouldn’t ignore that just because we love a good hero. In the tech world this is usually in response to some incident or a long haul, overtime slog.

When heroics happen you MUST question why they occurred. I’ve referenced post mortems elsewhere (I gave a talk on it at Upside!). One of the key byproducts of a post mortem or retrospective is acknowledgement and identification of what necessitated the heroics. The documentation you generate should be searchable and easily discoverable. As they occur the leadership team should ask themselves if there are any patterns that are emergent.

These retrospective habits must be exercised to allow introspection to be able to identify why heroics are necessary.

Most types of heroics fall into two buckets, Incident and Slog Heroics.

Incident Heroics

These sudden problems could manifest as app outages, system disruptions, or myriad other issues that can plague a tech company. The technical root causes are varied but the cultural and business causes are more easily categorized. The symptom is that an unexpected issue has occurred that a team must scramble to address. The longer it goes on the greater the impact to the business. These are relatively quick, highly visible across the company, and extremely high stress. There are plenty of articles out there on how to do incident management (I will add my thoughts to the pile one day as well) and you should do a search and establish guidelines BEFORE these happen.

The Slog

Slogs, or “death marches”, are generally caused by poor leadership with unrealistic expectations. These are usually instigated by business pressure manifested into a deadline without consultation from engineering. They are depressingly common. They get worse when engineering is consulted and then ignored without concession. A project plan may be developed as an attempt to rationalize the effort and give executives/leadership the signal that it will happen on time and on budget.

Once a slog has started the pressure builds slowly as the deadline looms closer and closer. More experienced, wisened ICs (individual contributors) will see the writing on the wall. ICs are eventually asked to work extra hours to keep project milestones. That “ask” shifts over to a requirement. Milestone deadlines may be missed but the final deadline stays the same. Eventually everyone on the team is working overtime weeks on end and all the bad stuff begins to build up. We’ll discuss some of this in our “Aftermath” section below.

Causes of Heroics

We’ve described two types of heroics, the short term (Incident) and long term (Slogs). There are also categorizations of the causes to heroics and we classify those into three different buckets (it’s buckets all the way down). The External Event, Tech Debt, and Poor Leadership and Planning.

The External Event

These particular root causes are ones that are truly beyond your control. Maybe an AWS data center gets hit by a small meteor and your service goes down for a while until AWS response teams take action. These incidents usually result in a response by heroes who then twiddle their thumbs as they realize they can’t do much except provide status updates to the rest of the company. If you manage an incident and this is the situation you should stop the thumb twiddling and ask questions about what recovery looks like and prepare recovery steps. As an example, a sudden influx of traffic can inadvertently cause a DoS inflicted by yourself.

What should you do then?

Well, as the stoics say, you can’t control external events but you can control how you respond. Once the root cause is identified as an event beyond the control of the team, while everyone is spun up on the problem, it’s worth asking the team how we could build something that would be more resilient to the kind of external event that just happened. For the above meteor example, maybe it’s ensuring your services are deployed and ready in another availability zone so that one particular data center really doesn’t matter that much. Resilience of your systems to external events should be up for consideration as a priority.

Afterwards, heroes should be spun down and refocused as much as possible back to their regular work.

Tech Debt

Tech debt based heroics are usually the result of slow incremental build up of sludge. The direct cause is a series of business and technical decisions that make heroics more and more necessary. Tech debt can be the primary cause of both incident heroics and hours heroics. A tech debt based incident may be the sudden outage of a service as it tips over. Tech debt may exacerbate the issue by slowing response time or increasing the required complexity of the fix.

Tech debt, as sludge, is the sort of stuff that increases the likelihood of project heroics being a requirement to execution. Perhaps there is one person on your team that knows how to navigate the swamp and they have to do it over and over again for a project, product, or feature to ship. Each time they bring a crew with them but they don’t have time to show them the path, just to get them to the other side.

What should you do then?

The standard response is to segment some percent of time out of product priorities and work on it in that small carve out. This, in my experience, is necessary but wholly insufficient. Product management needs to understand the issues tech debt is causing and invest their own time and effort into the remedy. Tech debt is not an issue that should be relegated as an engineering only concern. It is a business concern and should be labeled, tracked, and prioritized as such.

If you don’t already, engineering needs to be able to speak product/business language to get their concerns prioritized. That means gathering data and converting it to business impact. This may mean tracking KPIs, OKRs, SLAs, or any combination of metrics methodology that will allow engineering to converse and prioritize business concerns together. You may also be surprised and realize that the debt that is causing tons of noise actually isn’t so bad, especially when compared to a business opportunity.

When you work with product management to solve these issues, broader solutions are much easier. Product management will have the ties into operations or other parts of the business to get the long term maintenance of solutions off the backs of engineering. This is just one small reason to involve product management with the long term solution to tech debt.

Poor Leadership and Project Planning

This is a combination of two big issues but they tend to manifest in a similar manner. Often times when one or both of these are a problem unrealistic deadlines or business outcomes are set. Those deadlines are then baked into company estimates and become hard locked with company OKRs. They become inviolable and the death march begins. Usually there is some degree of push back on deadlines or the actual output of the systems but they are overridden with a combination of “do or die” rallying cries or other forms of bullying. This push back eventually stops as psychological safety exits the leadership team and can become endemic across an entire company.

Other manifestations of this (which also contribute to tech debt!) are the shiny object problem. Leadership does not allow sufficient time for their product teams to finish, deploy, and polish their features or products and demands that they move on. This leaves a trail of partially finished efforts that linger on in proof of concept phases and creates a massive maintenance burden.

What should you do then?

Long term, to solve these sorts of issues requires a full culture shift. You cannot tackle that in one big bite, you need to think short term and make incremental steps towards better. Demonstrate the value of the new approach until it becomes undeniable.

Short term you can identify key points of ownership and stand your ground on them. I did this by firmly stating that engineering owns delivery dates and all other dates should be deleted. It took a long time and a ton of pain for this but as more and more EMs stood their ground product actually saw the benefit. Engineering leadership rallied around this concept and we were able to wrest that one field in a planning spreadsheet from everyone else. Engineering suddenly owned planning, scoping, and could be active partners in these pieces. Projects started getting delivered on time and product saw the benefit. Pretty quickly it became the norm and the top level culture shift began.

The Aftermath

So some series of heroics have occurred. What’s next? There is the obvious potential “destruction” and build up of tech debt that might wind up looking a bit like this:

While this is an image from what New York looks like after the Avengers “rescue” it, the metaphor stands. Beyond the physical destruction wrought there is emotional, morale, and others caught in the blast radius of these sorts of heroics (another excellent book, Hench, captures this nicely).

If you don’t get control of the situation - Burnout, Loss of Trust, and eventually Apathy. There are the more obvious cleanup steps and business impact but these three can turn into morale black holes from which you cannot escape.

Burnout

As heroes are called on again and again the most pervasive danger is that of permanent burnout for the heroes themselves. If you ignore the limits of the heroes and call upon them again and again they will hit those limits. Heroes are after all human. At a certain point people will hit their breaking point but even before that their performance is impacted. They begin the process of burning out. There have been deep studies on burnout which illuminate the medical and psychological effects. The root cause of this however lands on how the employees are treated by their managers and leaders. The end result is the same, attrition.

Loss of Trust

Beyond just the immediate blast radius of the individuals involved in heroics are all the employees who observe. People talk and converse and heroics are a favorite subject at the metaphorical watercooler. They are full of drama, have twists and turns, high stakes, and contain everything that gets us to tune into any story. What’s distinct from the stories that we may watch in an evening binge is that these stories directly impact the employees. They may ask any of a series of questions such as:

Am I next?
Why don’t we fix this?
Do those teams suck at their jobs?
How did we get into this situation?
Why are we working on this project if we can’t even keep our core systems running?

These sorts of questions and challenges are wholly reasonable and sensible for individuals to ask. If you have a strong frontline management team they will be able to address them and buy leadership time to resolve any endemic issues. If you run out of time or do not have a strong frontline team trust begins to erode. You will also notice that some of the loss of trust here is not pointed squarely at leadership but can be directed at teams or divisions. These sorts of team to team losses of trust are a step in erosion. Once this erosion begins it is difficult to stop and it will compound upon itself until trust is completely gone.

Apathy Setting In

Wiser people have pointed out that the opposite of love is not hate, it’s apathy. Apathy is usually the result of burnout but can also be a result of watching others engage in heroics and no real solutions are embarked upon. In short, apathy sets in when there is no accountability. Lessons need to be learned and the company may need to change. The worst thing about heroics and being successful every time is that it’s possible no lessons are learned. As the stakes increase, as your business is (hopefully) successful, your user counts, the money you manage, the contracts you sign, all of that contributes to bigger consequences when there inevitably is a failure. Large companies can get by with people who don’t have passion, who are happy being cogs. Smaller companies and especially startups cannot afford apathy. It will kill them.

This is less likely to happen for incident response heroics and more likely to occur for death march or last minute project implementation. If the team is rallied around a problem and heroes step up to make everything happen it is critical that the outcome matters. Business outcomes must be representative of the effort that was just poured into the project. If they aren’t, there is a huge loss of trust that may sweep across the organization. This wave will NOT be limited to just the heroes but potentially everyone that observed the heroics. If the business outcomes are not achieved there must be accountability for the individuals who lead the team into the battlefield.

Best Intentions and Bad Outcomes

I recently listened to a linkedin livestream with Charity Majors and she made the great point that the praise you give is equivalent to encouragement for that behavior. When you praise heroics, you implicitly encourage it. It’s part of human nature. When people with fancy titles publicly laud a group of people for heroics, they will do it again. This sort of praise can have a pernicious effect when it comes to promotion and compensation time. Rather, we should celebrate big things done quietly. The engineering feats that go off without a hitch should be those that are lauded.

If heroics are standard, it’s indicative of big problems.

Address The Problem, Not The Symptom - Diagnosis and Changes

Diagnose the Problem

If and when a pattern is discovered, it is cause for a deep hard look at the current priorities of the company. If patterns begin to emerge and change is not made, you risk heroics becoming normal operating procedure and bad things begin to manifest…

Opportunity Cost

Heroics takes resources. Those resources are most easily calculated in human hours spent on the task at hand. Simply put, there is an opportunity cost that gets appended to every heroic action. Your frequent heroes are usually your best people. When they are tasked to perform heroics, what could this hero have been doing instead of saving the day? If Ironman wasn’t fighting aliens maybe he could have built a couple schools.

Change the Situation

One of Upside’s founders, Tom Vaughan, had a great saying that made its way into engineering culture. Don’t Be On Fire. It sounds simple but is actually quite difficult. What it means is that you need to manage your alarms and reduce the noise. Prioritize putting out fires, fixing bugs, do not let yourself be on fire! We also labeled this our “Rule 0 OKR” and prioritized it for engineering above everything else.

We compiled data on all of the alerts that were firing and got a sense of how much engineering time was being lost to false positives as well as real alarms. We were able to break this down into how many engineers didn’t do product work due to alarms. When we presented this to our product management team (we spoke their language, not engineering’s!) they immediately agreed that we needed to prioritize it.

Every team was then given a goal to reduce alarm fatigue and guidance on how to address each one of these alarms. The engineering team’s were given space to make a concerted effort to deal with the alarms and after two quarters we had eliminated the vast majority of false positives, fixed numerous long standing bugs, and made alarms based on anomaly detection vs arbitrary numbers.

This was a relatively simple to measure and assess situation but we managed to make an enduring difference to the culture of engineering as a result. We spoke product’s language, we made it business relevant, and we had a clear strategy on how to address it. We then followed up and held individuals accountable to execution of their part.

While this example is very specific to Upside’s situation as it was several years ago, it was a key win for engineering. It allowed us to see what was possible when we worked with our product and business partners to improve not just engineering best practices but also our overall output as an organization. Find your partners, get them on board, seize control of the situation and do it, not just for engineering, but for the health of your company.

References

This Is Fine web comic - https://gunshowcomic.com/648

Hench - https://en.wikipedia.org/wiki/Hench_(novel)

Ruined City Image - https://scifi.stackexchange.com/questions/65080/has-new-york-been-repaired-to-its-state-before-the-avengers

Nathan Coleman

I enjoyed this. I appreciated how it was long-form instead of a more cursory survey of the topic.

What struck me was "Heroes are people who, against all odds, prevail and win the day." "Against all odds" means that it will often fail. If the first and last line of defense is essentially a probabilistic and unlikely "solution," then it's simply going to fail a chunk of the time.

I also find "hero" to be a patronizing and manipulative term in a lot of corporate contexts. Working overtime to keep up the backup generator for a hospital, sure. Missing your daughter's recital to keep widgets.io selling a subscription service to ai.tech, not so much. It creates a framing that makes the dissolution of important boundaries more palatable.

I will admit that with a lot of tech roles, it's essentially priced in—we have a lot of privilege, and occasionally getting ramped up to keep a service running is a fair enough tradeoff. I think the "hero" mentality is significantly more harmful in the service industry, without reasonable compensation, e.g., a lot of the talk of heroes and frontline workers in the pandemic without more meaningful, structural support.

Excellent post and I look forward to seeing more from you!

Expand full comment

1 reply by josh

1 more comment...

josh’s Substack

Discussion about this post