Resend is committed to providing reliable and consistent service to our customers. However, we recognize that incidents might occur. This document outlines our approach to incident management, including detection, response, communication, and post-incident review.
An incident is declared if there is customer impact or degraded performance. An internal incident can also be declared if we anticipate customer impact or degraded performance. This allows us to be proactive in our response and minimize the impact on our customers.
We believe in transparency. If an incident has customer impact, it should be declared. No incident is ignored for being too small. This is the only way we can build trust with our customers.
We also do not believe in shifting blame. In the unfortunate case that our providers are experiencing issues, we declare an incident and work with them to resolve it. Uptime is our responsibility, whether it is our own service experiencing issues or our providers.
If an incident is declared, we will communicate it to our customers via our status page. In situations where a single customer is impacted and we have a direct relationship with them, we work with them via Slack.
We have three severity levels for incidents:
At Resend we have a weekly rotating role across the engineering team known as The Fixer. For additional information on how we manage the rotation, please refer to How we handle on-call rotations
It is the responsibility of The Fixer to work on outstanding customer requests, new issues, and be on-call for incidents.
Incidents are managed through incident.io. The Fixer will monitor our systems, mainly through DataDog monitors that send alerts to Slack and incident.io. The Fixer will be paged by incident.io and a Slack Channel will automatically be created for the incident.
Incidents can also be created by anyone in the Resend team on Slack using the incident.io Slack integration.
We recognize that The Fixer will do their best to be the first responder to incidents, but may not always be available. In these situations, the secondary on-call (the previous week's Fixer) will be paged. Retry intervals and additional escalation policies ensure that incidents are acknowledged in a timely manner.
Once the incident is acknowledged, it will be triaged and either accepted or declined.
The first responder, usually The Fixer, will request more engineers to help if the incident is accepted. The Fixer will assign the Incident Lead (IL) and Communication Lead (CL) roles.
The Fixer is usually the Incident Lead, but can assign someone else if they are better suited to coordinate the incident response. The Communication Lead is responsible for updating the public (and internal) status page and communicating with customers. The Incident Lead can also work on resolving the incident, but will rely on the rest of the team's help to resolve the incident quickly.
It is common for the team to jump on a Google Meets call to coordinate the incident response. In the case that the incident is caused by a code regression, a rollback will be performed to restore service quickly. Once the service is stable, the team can work on a fix and redeploy.
The Communication Lead will provide status page updates as information becomes available, generally within 30–60 minutes for major and critical incidents.
The team will work together to resolve the incident as quickly as possible. Once the incident is resolved, the Incident Lead will announce the resolution and the Communication Lead will update the status page.
We treat every incident as an opportunity to learn, grow, and take ownership. Getting to the root cause helps us avoid repeat incidents and continuously improve on product quality and reliability.
For every incident, the IL prepares the incident report, which includes a timeline of events, impact assessment, contributing factors, resolution steps, and follow-up actions.
Once completed, the incident report is shared with the team and discussed in a blameless post-mortem meeting. The focus is on learning and improvement, not blame. Follow-up actions are created on Linear and tracked and assigned to teams with deadlines based on the severity of the incident.
For a more detailed breakdown of our post incident management process, please refer to our How we handle post incident reviews