Job Description
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Software Reliability Engineer in United States.
As a Software Reliability Engineer, you’ll play a crucial role in ensuring system stability, rapid issue resolution, and platform resilience across large-scale distributed systems. This role goes beyond traditional DevOps or infrastructure tasks — you’ll dive directly into live systems, diagnosing complex incidents, identifying root causes, and preventing recurrences. Working closely with cross-functional teams, you will strengthen reliability across applications that impact millions of users. This position offers a highly collaborative environment, where your technical insight and problem-solving skills directly protect uptime, customer trust, and overall business continuity.
Accountabilities
· Lead incident response efforts to quickly diagnose and resolve issues in distributed production environments.
· Use observability and monitoring tools (such as Dynatrace and Azure Application Insights) to identify root causes and validate resolutions.
· Collaborate with engineers across APIs, microservices, and data layers to stabilize live systems and prevent future disruptions.
· Write and run targeted automated tests using tools like Jest, Cypress, or Playwright to confirm issue resolution and improve reliability.
· Communicate root causes and fixes effectively to both technical and non-technical stakeholders.
· Partner with platform and DevOps teams to enhance monitoring, alerting, and deployment workflows.
· Participate in on-call rotations for high-priority production incidents and contribute to continuous improvement of reliability practices.
Requirements
· Minimum 2 years of experience in software engineering, production support, or incident response.
· Strong proficiency in JavaScript/TypeScript with the ability to debug live applications and services.
· Solid understanding of SQL and NoSQL databases for tracing and troubleshooting data issues.
· Experience working within Azure or GCP cloud environments.
· Proven success stabilizing distributed or microservice-based architectures.
· Excellent communication and problem-solving skills, with the ability to clearly articulate findings.
· Preferred: experience managing P0/P1 incidents, knowledge of observability tools (Dynatrace, Datadog, or OpenTelemetry), and familiarity with event-driven architectures or message queues.
Similar Jobs
Field Engineer - High Voltage (Remote - US)
Jobgether
Sr. Project Manager (Remote - US)
Jobgether
Senior Software Engineer - Backend - Growth Platform (Remote - US)
Jobgether
Senior Application Security Engineer (Remote - US)
Jobgether
Engineering Manager - CAD/3D Research and Novel Algorithms (Remote - US)
Jobgether
Data Engineer (Remote - US)
Jobgether
Implementation Engineer (Remote - US)
Jobgether
Senior Data Engineer (Remote - US)
Jobgether
Staff Mobile Engineer (Android) (Remote - US)
Jobgether
Senior Product Manager (Remote - US)
Jobgether
IoT Security Consultant- Remote (Anywhere in the U.S.)
Jobgether
Senior Software Engineer (TypeScript) - AI/ML (Remote - US)
Jobgether
Design Director (Remote - US)
Jobgether
Senior Product Manager, Reporting & Analytics (Remote - US)
Jobgether
Firefox OS Integration Engineer, Mac OS Engineering (Remote - US)
Jobgether
Disclaimer: Real Jobs From Anywhere is an independent platform dedicated to providing information about job openings. We are not affiliated with, nor do we represent, any company, agency, or agent mentioned in the job listings. Please refer to our Terms of Services for further details.
