On-Call Procedure
This section outlines the steps to take should an alert occur while it's your turn to be on call. There are two sources of alerts: AlertManager and Firebase. Firebase alerts originate from the app. AlertManager alerts originating from Datadog, these alerts are caused by errors or latency from the back-end API.
On-Call Rotation
Each week a backend engineer will be on-call. Their on-call hours are the same as their business hours and a slack reminder will show up in va-mobile-app-alerts each Monday tagging whoever is on for that week.
Handling Backend Alerts
- First use tools described above to track down the source of an issue.
- Services in Datadog can show a good overview of the health of our endpoints. This is also a great starting point to dive deeper into various issues.
- Logs in Datadog can help you find more data or trace the requests before the error occurred.
- Datadog's Application Performance Management tool is also configured for vets-api. It breaks down the ruby, database, and upstream calls down so you can determine the source latency. The APM also provides p50 and p99 latency data to let us know how slow the worst 50% and 1% of calls are doing.
- Lighthouse API Status Page is helpful in finding out if lighthouse errors were expected
- If you've determined that the source of the issue is an upstream service contact the relevant party.
- If you believe a forward proxy is down or having trouble connecting to the service. Then contact the Operations team via DSVA Slack's #vfs-platform-support channel. To open a support ticket type
/support
. This will open a modal window with a form rather than posting a Slack message. For the 'I need help from' field select 'Operations Team'. Then add the details in the 'Summary of request' field. Additionally, if you are unsure of who to contact you can make a support request. - Finally if the error is not from the API, a forward proxy connection to an upstream service, or an upstream service itself but rather an issue with infrastructure that we (and VSP/VFS) control then a SNOW ticket should be opened. Only a DSVA team member can do this. Reach out to a stakeholder and have them open a SNOW ticket for you.
Handling New Issues
The on-call engineer is responsible for monitoring the mobile-new-issue-alerts channel, look into each new issue, and determining whether the issue is worth a ticket for further investigation or remediation. The engineer should add a comment to the issue in the channel with either a link to the investigation ticket or an explanation of why it's not worth a ticket. See new issue monitor documentation for more information about new issues.
Other Slack channels to monitor
There are other slack channels that the on-call enginner should pay attention to in case there are updates to maintenance windows and/or other urgent changes. For these channels, the on-call engineer should only need to pay attention to @here and @channel messages.