Elevated API failures

Incident Report for Demandbase

Postmortem

On Friday, May 5, starting at approximately 2:15 PM PDT, Demandbase experienced a brief outage related to its API, during which customers would have seen an increase in errors when making calls. The total incident time was 10 minutes, but customers may have seen slightly longer times depending on their location. We take the reliability and availability of our API seriously, and below is a post-mortem that describes what occurred and what the team will be doing to prevent the issue from happening again.

Background

The team has a regularly scheduled update of a non-customer facing application that occurs twice a week. During these updates, the team will also make modifications to the database.

Incident Details

During our regularly scheduled release on Friday, at 2 PM PDT, the team deployed the new code including an update to the database. The database update was replicated to all slaves, including the slave databases that support the API endpoints. When the database received this replication update, it locked the database and associated tables. Additionally, the API endpoints reside behind a load balancer that uses a specific application health check that determines the health of the application, and the availability of the database. Due to the fact the health check was failing, servers were removed from the load balancer and new ones came online.

The combination of the locked table in the database as well as the load balancer requesting new servers, caused the incident to last 10 minutes.

Outage Timeline (all times PDT)

2:10 PM - Deployment starts
2:11 PM - Deployment complete; Database update starts - replication automatically occurs
2:15 PM - Receive alert regarding API errors/failures
2:18 PM - Replication completed; Load balancers started process of adding new servers and removing ones deemed unhealthy
2:23 PM - All regions have new servers.
2:25 PM - All alerts cleared

Preventative Measures

The team has identified areas for improvement following this incident:

Health check modification - the health check will be modified to not remove end points if the database is unavailable.
Database - the team will be implementing materialized view configuration for the database, to avoid locks on tables in the database that are used by the API

Summary

The Demandbase team apologizes for the issue. We take these incidents seriously and are working to ensure a reliable and highly available system.

Posted May 08, 2017 - 13:47 PDT

Resolved

This incident has been resolved.

Posted May 05, 2017 - 15:00 PDT

Investigating

From approximately 2:15 - 2:30 PM PDT, the API experienced an increase in database errors potentially resulting in API failures.

As part of our continuous improvement process, we have identified the root cause of the issue and we will be making modifications to ensure the same issue does not occur again.

Posted May 05, 2017 - 14:40 PDT