Postmortem 2023-01-26 — #1
- Incident Type: Security Risk
- Severity: Critical
- Impact: None — thankfully
It's not everyday the President of the United States calls the CEO of your customer. Risk of nation-state attack went from hypothetical tabletop exercise to highly likely.
- Proactive incident management — thanks to Heartbleed vulnerability earlier in 2014, we set up a more robust incident response process. Instead of having two people working around the clock, once the incident was declared based on risk posture change, 50 engineers were pulled into structured organization to work in time-limited shifts to secure as much as we before Christmas Day release.
- Don't let a good incident go to waste — there was a backlog of known security tasks that were viewed by engineer as nice-to-have's. In an instant, they become P0/P1 tasks that accelerated our security roadmap by years over a 2 week period.
Postmortem 2023-01-26 — #2
- Incident Type: Production
- Severity: Major
- Impact: Affected top 10% customers
Customer A was seeing Customer B's data and Customer C was seeing both A and B's data in their data warehouse. Ended up being a false alarm caused by a webcrawler.
- Check the data — the incident could've been resolved sooner if rather than taking a macro/zoomed out approach, we looked at a couple examples of data to see what it had in common. Turns out all the events were being submitted by beta build of Chrome for Windows.
- It's not always our fault, but it's usually our problem — the incident should have been escalated much sooner but because of the timing of year, support couldn't get a hold of anyone in engineering to take the customer reports seriously.
Chatham House Rule