The Experience of On-call (Paging) for Amazon Software Engineers

in blog •  6 months ago 

From 2020 to 2021, I worked as a Software Development Engineer (SDE) at Amazon (AWS = Amazon Web Services). Some say SDE stands for "Someone Does Everything," and that’s quite accurate. Every SDE at Amazon is responsible for their code from design to testing and must also handle on-call duties periodically. Most Amazon teams consist of 6-8 engineers and a Software Development Manager (SDM), adhering to the "Two Pizza Team" principle.

Each Software Engineer (SDE) typically takes an on-call shift every 6-7 weeks, covering a 24/7 schedule for an entire week. During this period, you're responsible for mitigating and fixing any operational issues. This requires installing pager software (Pong) on your phone to receive alarms related to your team's products. When an alarm goes off, you must acknowledge the ticket within 15 minutes. Failure to do so escalates the issue up the management chain, potentially reaching the CEO (Jeff Bezos), which could negatively impact your performance review.

The team I was at runs a 2-week sprint (Agile Development). The engineer who is on call should prioritize on the oncall devops (as a task with points), if no or little devops is needed, then he/she can pick up other development tasks. The oncall engineers need to complete a oncall report and present it at the weekly deveop meetings to many engineers and stake holders on the week after the oncall. This was the most stressing bit as every engineer needs to investigate each issue and finds the root causes and applies the fix.

Amazon's philosophy is that you own your code. This means you must address issues even in the middle of the night. Your priority is to mitigate the problem first, then investigate and resolve the root cause later. For instance, if an issue arises at 3 AM, a temporary fix like rebooting the server is acceptable, and a detailed investigation can follow later. If you're paged at night, you can rest more the next day and skip the morning standups if necessary.

While on-call duties can be stressful and depressing, they effectively train Software Engineers in DevOps skills. New products often generate more alarms due to higher sensitivity settings. Amazon is proud to provide a 99.99% SLA (Service Level Agreement) thanks to the SDE on-call culture.

Below is an example of a page I received during my last shift at Amazon in 2021. The alarm sound can be quite jarring, reminiscent of air raid sirens, though there are options to choose less intrusive or even happier ringtones.

--EOF (The Ultimate Computing & Technology Blog) --

Blog: The Experience of On-call (Paging) for Amazon Software Engineers

Steem to the Moon🚀!

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE BLURT!