Software Alerting maturity levels

Over the years i have been part of several incident responses in production and pre-production environments for Software as a Service (SaaS) products. I have observed significant variations amongst teams w.r.t. their Alerting setup and processes. After some thought i came up with below Alerting maturity level starting from lowest to highest.   

1. Team doesn't have any monitoring dashboards to monitor it's services in production or pre-production environments. Essentially, monitoring is considered low-priority as the team is "too busy" building new features. I have seen this trade-off in startups still trying to find product market fit. Another reason for this situation is "luckily" the team hasn't had production incidents - maybe due to low usage or less complex software. Once a company starts to have real users  then it's time to invest in setting up monitoring dashboards.

2. Team has  monitoring dashboards for Production env only but not pre-production or internal environments. This leads to complete lack of visibility on the state of internal environments. Result is issues are found much later in production env. instead of pre-production env. Easy fix is for teams to setup same monitoring dashboards across all their pre-production and production environments and keep them as consistent as possible. Best practice is to use yaml, json or some config driven monitoring dashboard so it's easy use it across environments consistently. Essentially, monitoring as code so changes to dashboard are peer reviewed like product or infrastructure code.

3. Team has monitoring dashboard but no alerting. This is like eating half-baked cake - you know the taste :) . No human can or should be expected to continually watch dashboards. Proper alerting ensures when metrics deviate the team gets notified and the team acts on the alert. As a start a team can setup alerts when Error rate and Duration exceed a certain threshold (value greater than 3 sigma) over a certain time range ( e.g. last 5 min).

3. Alerting is setup but not working as expected - This happens when alerting rule are setup incorrectly. Some scenarios include - wrong units,  alerting on the wrong metric or Alert is getting triggered correctly, but not sending alerts to reporting tools (Slack,  Pager Duty etc.).  Good code review practices and exercising the end to end workflow of alerting generally catch such cases. 

4. Over Alerting - Teams have setup too many alerts in an attempt to detect as many problems as possible. This typically happens when the team is inexperienced in alerting. This leads to behavior where the team members stop watching the alerting channels. Overtime teams learn to fine tune the alerts to the optimal level. 

5. Alerting is  triggered but no one is watching the alerts channel. The on-call could be in meetings , on vacation without assigning a backup or even worse may not be trained to keep an eye on the alerting channel. 

6. Alert Triggered and watched, but for some reason it was disregarded. Team members confuse it with another lower priority issue and don't investigate further. This happens when the oncall is over-burdened with regular sprint duties as well. Another reason is when 24x5 oncall support is not setup or clarified leading to delayed alert action. 

7. Alert Triggered and team took quick action - Finally, we are getting to the place where teams need to be. On-call investigates the alert and takes necessary action to mitigate and resolve the issue as appropriate. There is a proper chain of command setup for alerting escalation so other more specialized team members can be involved as needed.

8. Team evaluates alerting as part of each new feature and escaped bug. Alerting setup and process has become a part of the team's culture. Team considers alerting during new feature design phase and also has a good retrospective process for escaped bugs. This ensures alerting is always kept up to date, is optimal and helps deliver the high reliability that customers have come to expect. This ensures Mean Time to Identification (MTTI) and Mean Time to Resolution (MTTR) are kept as low as possible.

In conclusion, effective and efficient alerting is a journey. As a software team/company grows in size and supports more users alerting starts to assume more importance. You can use the above steps to assess where you are and where you want to be in your alerting journey by making the investments in alerting tools, infrastructure, process and culture.

Comments

Popular posts from this blog

Playwright - The Game Changer Test Automation Tool

New Habits for 2023