Latency Variability¶

With measuring latency, it should be acknowledged that all requests are NOT equal, and their response times can vary widely depending on several factors, including load, the complexity of the request, and the number and performance of the sub-systems a request must traverse. Thus, using an average response time metric (adding up all response times and dividing by the number of responses) can be extremely misleading due to outliers.

Therefore, VA recommends tracking latency using percentiles. Percentiles break down the percentage of responses above or below a threshold, allowing for better data interpretation by handling variability and outliers within the data.

Guidance

Track latency using percentiles.
Use capabilities from the Application Performance Monitoring (APM) tools available to measure latency.
Be aware of timeouts within the infrastructure the API runs in.
Be cautious of alert fatigue.
Monitor each endpoint within the API separately, but have a single Service Level Objective (SLO) for the API.
Note latency behavior in the API's documentation.

Tracking P50 (50^th Percentile), P90, and P99 latency metrics can provide a comprehensive view of an API’s performance from median use to edge cases:

P50 Latency: Represents the median response time, providing a benchmark for the typical user experience. 50% of requests were completed within a given time target and 50% of requests took longer than the given time target. For example, a SLO might be 50% of requests must be under 500 ms.

P90 Latency: Identifies the performance outliers that aren't as extreme as the worst-case scenarios but could still negatively impact a significant portion of users. In this case, 90% of the requests were completed within a given time target, but 10% took longer than the given time target. For example, an SLO might be, 90% of requests must be under 1000 ms.

P99 Latency: Reveals the upper bounds of latency experienced by users. In this measure, 99% of requests were completed within a given time target but 1% took longer than the given target. For example, an SLO might be, 99% of requests must be under 5000 ms. That means 1% could take longer and indicate an anomaly or unusual traffic that might be interesting to inspect.

Application Performance Monitoring (APM) tools often provide these metrics out of the box. Use these tools if possible and don’t reinvent the wheel.

When setting the SLO for your API for these metrics, be aware of timeouts set by the gateways, proxies, database connections, or even the application server the API runs within. Some have timeouts of 10 seconds, while others use a 15-second timeout. Additionally, timeouts within Cloud services may not be configurable. For example, AWS Lambda has a 10-second timeout, and AWS API Gateway has a 30-second timeout. Neither of these can be changed.

Knowledge of the timeouts is important when setting the P99 SLO. Take care not to let the SLO push past the boundaries of the API’s infrastructure and give a false positive. For example, if 99% of requests should take under 10 seconds, but a piece of infrastructure, such as the gateway, times out in 7 seconds, this metric isn’t as informative since no request could ever exceed 7 seconds.

To understand if an API is meeting an expected level of service, the P99 SLO value should be set so that the vast majority of requests are completed within that time frame. For example, if there are times that a request can randomly take over 5000 ms, but 99 percent of the time it is faster than 5000 ms, then set the P99 for 5000 ms. This allows outliers to be isolated and can help identify where improvements can be made, and timely feedback if the trend is unexpectedly worsening.

When setting SLOs and expectations for the level of service at the P50 end of the spectrum, set this number within reason so alert fatigue doesn’t set in, while at the same time, correct alerting triggers when non-normal behavior begins. For example, the API on average responds within 350 ms to its requests. The API example initial baseline SLO could be:

Within a rolling 30-day period; 50% of requests must be under 500 ms; 90% must be under 1000 ms; and 99% must be under 5000 ms in production.

7-day window: This is for active monitoring with a stricter setting than the SLO defined above and should be aspirational. e.g. P50 @ 350 ms, P90 @ 750 ms, and P99 @ 2500 ms. This window helps identify problems early.

30-day window: This window tracks compliance with the SLO defined above and is the period used within the SLO.

90-day window: Tracks trends over time in the same manner as the availability 90-day window.

Of course, there is often an outlier endpoint(s) within an API. For more information on handling slow-responding API endpoints, see Handling Latency Outliers.

VA recommends monitoring each endpoint separately when monitoring latency to troubleshoot issues with particular endpoints. However, the API should have a single SLO defined for all the endpoints within an API instead of each endpoint individually.