Monitoring SLOs¶
Most Application Performance Monitoring (APM) tools offer SLO dashboards out of the box; if not, API teams can create custom dashboards to visualize SLO compliance.
Datadog, shown below, has a built-in SLO feature that allows teams to set 7, 30, and 90-day windows and track one or more SLIs within them. It also calculates an error budget.
Error Budgets¶
An error budget can help a team balance feature development and service quality. Quality should always be a fundamental requirement for new features, and feature development should not come at the expense of quality.
The Error Budget enables an understanding of how much quality has been sacrificed in a given time window, and what budget is left. This data helps the team align priorities. If the API is close to exhausting the error budget, engineering efforts should focus on improving reliability and performance while naturally decreasing the emphasis on feature development.
Conversely, having plenty of overhead in an error budget does not mean the team should de-prioritize quality and rush to implement new features. Instead, it indicates that the team has successfully managed quality alongside feature development.
Calculating Error Budgets¶
If tooling such as Datadog is unavailable, teams can manually calculate their error budgets.
If availability is 99.9% over a 30-day window, the error budget can be calculated as:
Total time (minutes) = 30 days × 24 hours/day × 60 minutes/hour
43200 minutes
Error Budget (minutes) = 43200 minutes × 0.001
43 minutes
In this case, the rounded error budget would be 43 minutes. Teams can then track the downtime that has already occurred in the measurement window to determine their burn rate.
Error Budget Left, as described in the example above, would be:
Error_Budget_Left = Error_Budget - Spent_Budget