What to Measure¶

SLIs, as defined in the introduction, should be indicators that provide insight into how our customers interpret the reliability of VA APIs. The SLIs define what is being measured to meet an SLO.

What makes a good SLI?¶

While all SLIs are metrics, not all metrics qualify as good SLIs. For example, CPU utilization is a metric but not a good SLI. DevOps may have alerts for CPU utilization, memory usage, etc., as they can be early warning signs that reliability is about to suffer. However, they do not make good SLIs because the customer does not directly experience them.

Availability, latency, and error rate are customer-facing metrics the customer does directly experience. Individual APIs may also have feature-level SLIs, such as “claims processed”. However, availability, latency, and error rate are general metrics that should be tracked for all VA APIs.

Guidance

Availability, latency, and 5xx error rate are recommended metrics to be tracked for all VA APIs.

To track SLIs, turn them into a ratio of events being measured over total events. Then, measure that ratio over time.

The first metric, availability, is the percentage of time a service is usable. It is usually expressed by a ratio of hours the service is available over total hours for the window that is being measured.

The second metric, latency, measures the actual time to respond to a request. The value is the percentage of requests with response times below a certain target value within a given time interval. The measure of compliance is the percentage of requests with response times under the target value. For example, an SLO might be, “latency for 90% of all requests must be 1000ms or below”. Some API endpoints may be slower than others due to system complexity, data transformations, and the nature of the operation (e.g., writes, complex reads, form processing, and file uploads). For more information on this, see Handling Latency Outliers.

Finally, error rate measures how often an error occurs and is represented as a ratio of all requests received (Number of requests resulting in errors / Total number of requests). This is typically displayed as a percentage. VA recommends making 2 error rate calculations, one for server failed errors (5xx errors) and one for consumer-supplied data errors (4xx errors) to avoid skewed results in one total error rate calculation.

Example Calculations¶

Metric	Ratio	Ratio Example	Percentage Example
Availability (30 day window)	Uptime/Total time	715/720 hours 720 = 30*24 hours Down for 5 hours for a maintenance upgrade	99.3% Available
Latency beyond 1000ms	Requests with response times < 1000ms / Total requests	9500/10000	95% of requests are faster than 1000ms
4xx Error Rate	4xx Responses / Total Responses	40/10000	0.4% Error Rate
5xx Error Rate	5XX Responses / Total Responses	50/10000	0.5% Error Rate