Senior Incident Manager (Remote - US)

Jobgether
United States
On-site
Full-time
Posted 11 days ago

Job Description

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Incident Manager in the United States.

This role offers a critical leadership opportunity in managing high-impact incidents for cloud-based services. You will coordinate cross-functional teams during major incidents, ensuring swift resolution while maintaining clear, accurate, and timely communication with stakeholders and customers. The position combines operational leadership, technical expertise, and strong communication skills to drive reliability, root cause analysis, and continuous improvement. You will mentor peers, improve incident response processes, and influence how complex distributed systems are monitored and maintained. This role is ideal for someone passionate about operational excellence, proactive problem solving, and driving confidence in technical systems during high-pressure events.

Accountabilities:

  • Lead critical production incidents, coordinating multi-disciplinary response teams to mitigate impact and restore operations rapidly.
  • Drive root cause analysis and collaborate with engineering teams to implement long-term reliability improvements.
  • Summarize key learnings from incidents, communicate actionable items, and ensure follow-through of technical and procedural improvements.
  • Own incident communications, providing timely and accurate updates to internal stakeholders and empathetic, customer-facing notifications.
  • Mentor and train colleagues in incident management, communication best practices, and technical response strategies to elevate the overall team performance.
  • Continuously refine incident response processes, playbooks, and automation to improve efficiency and reduce downtime.

Requirements

  • 5+ years of experience in incident management, site reliability engineering, or production operations for large-scale, cloud-native systems.
  • Proven ability to lead high-severity incidents, identify impacts, isolate fault domains, and coordinate multi-team responses.
  • Strong knowledge of cloud infrastructure (AWS, Azure, or GCP) including compute, networking, storage, and observability.
  • Hands-on experience with log analysis, debugging, and observability systems (Datadog, Elasticsearch, Splunk, Prometheus, Grafana, OpenTelemetry, etc.).
  • Proficiency in at least one programming or scripting language (Python, Go, Bash) for diagnostics and automation.
  • Experience creating and maintaining incident playbooks and communication templates for consistent, high-quality updates.
  • Exceptional communication and writing skills to summarize complex technical situations for both technical and business audiences.
  • BS, Master’s, or advanced degree in Computer Science, Computer Engineering, or related technical field.

Disclaimer: Real Jobs From Anywhere is an independent platform dedicated to providing information about job openings. We are not affiliated with, nor do we represent, any company, agency, or agent mentioned in the job listings. Please refer to our Terms of Services for further details.