Production support in agile / scrum

A common question I see asked is “how do we do production support in agile / scrum?”. This is a harder problem than it might at first seem. It definitely adds some complexity to the problem of prioritizing work. There are three stages to solving it properly.

What actually is production support?

Production support or maintenance means maintaining a system once it has gone to production, i.e. it has been released to customers. This generally takes the form of fixing production incidents. Something goes wrong and you need to do a production change. We can divide these incidents into two camps:

  • Something has suddenly gone wrong and a service is no longer working (or no longer working to an acceptable level)
  • A bug has been discovered and is affecting customers.

Two types of production issues

The first type is generally caused by some kind of temporary infrastructure problem: a server falls over, or runs out of memory, a message queue gets blocked, or a network switch gets overloaded.

They are usually non-deterministic (sometimes I try and log in and I get an error and sometimes I don’t), come quickly and go away quickly (restart the server, recycle the application pool). They don’t involve bugs in the code so they don’t require a code fix.

The second type is more serious: there’s a bug in the code and it requires a code fix. These are usually deterministic (it happens every time you try and reproduce it) and they don’t go away quickly (you need to prepare, test and deploy a code fix). How easy the second type is to fix will depend on the Continuous Delivery maturity of your application.

Who does these fixes?

The first type is usually fixed by an Operations team who maintains the infrastructure on which your application runs. Although maybe not, depending on what degree your organization has embraced DevOps.

The second type is either fixed by a software team. But which one? And how? There are a few approaches.

You could use a dedicated maintenance team

I actually don’t recommend this, for a few reasons.

Firstly, the best people to fix something are the people that built the thing. Nobody knows it as well. They won’t need to go frantically looking through documentation to find out what has gone wrong.

Secondly, nobody wants to be on the team. You either farm the work out to junior people (who you probably don’t want fixing mission-critical software), or you rotate the role. But that is disruptive to team cohesion and morale.

Thirdly, if you know you won’t be supporting something, it can encourage sloppy work and technical debt. And we don’t want that.

You build it, you fix it

I recommend the “you build it, you fix it” rule. A team has to own what they built. They know it better than anyone and they should take ownership of whatever they build.

But how are you supposed to get and stay on top of these issues? I see it as a three-stage process.

Production support in agile phase 1: Track and prioritise

The first thing you need to do to tackle the problem is to clearly identify the problem. You need to log incoming production issues in some sort of bug-tracking system, with appropriate details, priority, etc. You need to make them visible (to the team and to stakeholders).

Your product owner also needs to prioritise these alongside the work the team is currently doing. It can be easy to lose sight of the plan if you keep jumping on every issue that comes in without triaging them and make a clear priority decision. But Scrum involves continuously reprioritising work to maximise the long-term value and ROI of the product.

Of course, if your team is doing pure maintenance, there is no other work the team is doing to prioritise against. But this is not a good plan in the long term.

If you are clearly tracking and prioritising your incidents, you’re on the right track. Make sure that the product owner is regularly reviewing this list and prioritising them not just against each other, but against other in-flight work (“fixing this defect is more important than getting that user story completed”).

Should you use two backlogs?

Some teams have two product backlogs: one for feature development, and one for maintenance. I don’t recommend doing this. A product should have one backlog, even a big product. Having multiple backlogs confuses things, makes it harder to make prioritisation decisions, and can ruin some of your metrics.

How should you action the issues?

Production defects and incidents should generally go into a sprint as part of sprint planning, but sometimes you should jump on them as soon as they arise (if they are very nasty). Make sure you do not estimate defects or earn points for them. I explained why in this article on estimating defects.

Another unusual approach is to not create new issues, but add the defects as acceptance criteria on the user stories that were done previously. I don’t advise you do this.

Once a story has been completed, it is done. It should not be brought back from the dead. Any changes to that feature or component should be done as a new product backlog item.

What about a maintenance sprint?

Some teams just defer all the issues and do a “maintenance sprint” later on. This is a terrible idea. All issues should be triaged, and if they are critical, they should be done right away. Also, if you are doing Scrum, the team should be building a new product increment each sprint. Fixing bugs from months ago does not count as building a new product increment.

Production support phase 2: Own the stack

The next stage is to “own the stack”, which means try and get your team to be responsible for much as much of the stack as you can. To ensure you can resolve issues quickly and the team can build up complete knowledge of their own system, you want to get a fully cross-functional agile team.

If you need to hand defects over to a DBA team or an Integration Services team or a Security Services team, your cycle time is going to fall and you’re going to suffer from a lack of control and ownership. Build up the skills, knowledge and ownership in the team until you control the entire slice of the application. Eventually, you want your team’s remit to include not just design/test / build but run also.

This is part of (but not all of) the journey to DevOps: the people who build the application, also run the application. Doing this will involve overcoming technical and organisational hurdles (mainly organisational), but is necessary. This will further reduce handovers, documents and ambiguity.

You also need to make sure your team are performing proper RCA (Root Cause Analysis) on these issues, and doing PCAs (Permanent Corrective Actions), rather than just quick hacks to fix symptoms.

Production support phase 3: Optimise the stack

Once your team has ownership of the stack, you need to ruthlessly optimise it, by putting in extensive amounts of automated tests, both functional and non-functional. The key principles are:

  • The existing feature set is covered by a full set of automated regression tests
  • No new features get in without full automated test coverage
  • Every time a defect is fixed, at least one test is put in place for it
  • The team aggressively pays down technical debt.

If you do this properly, you will achieve two important benefits:

  • You will be able to push changes out to production very quickly with low cost and risk
  • Your defect/production incident counts will approach zero.

To clarify: your incidents from code bugs (the second type) should approach zero. You might still have the first type of incidents, resulting from weak or faulty infrastructure. Solving that is a story for another day!

Agile Production Support Tips and FAQs

Why shouldn’t we estimate the defects / incidents? Because they’re not user stories, they don’t represent value, and you can’t really estimate them anyway.

How do we prioritise the defects? That’s up to your product owner. Some will be higher priority than some stories, some will be lower. It is a case by case basis.

How do these issues fit in with Kanban?

They are generally treated like any other work item. You may want to have specific WIP (Work In Progress) limits for them, however, to ensure teams are not taking on too many at a time.

When do we prioritise defects and incidents? You should do this at sprint planning. You might sometimes get an emergency incident that requires an immediate response, however.

What is the difference between a defect and an incident? An incident is whenever something breaks in production. A defect is when there is a bug in the code. Most incidents are caused by defects. That is, there is a bug in the code, and it made an application behave in a way that it was not supposed to. But you can have incidents that are not caused by bugs, i.e. a server has a bad day and falls over. You can also have defects that do not correspond to production incidents. You might have found the bug before it went to production, or maybe it went out to production but nobody has encountered it yet.

How can we do sprint planning when we have defects with no estimates? People are confused around this, but it’s not hard. Just look at how many stories you did last sprint, taking into account how many defect fixes you did and how many you have this sprint. Assume for simplicity’s sake that each defect is as big / hard as any other. So say last sprint you did 60 points and had four defects included in the sprint plan. This sprint you have two defects so you plan for 65 or 70 points instead of 60.

Leave a Comment:

5 comments
Add Your Reply