Match score not available

Site Reliability Engineer

Remote: 
Full Remote
Contract: 
Experience: 
Mid-level (2-5 years)
Work from: 

Offer summary

Qualifications:

Strong experience with application and web servers, Familiarity with Infrastructure Automation and CI/CD tools, Experience with multiple cloud platforms, 2+ years in incident resolution in operations, 3+ years programming experience with various languages.

Key responsabilities:

  • Define and refine service lifecycle from inception to retirement
  • Measure and monitor services for availability and latency
  • Design, build, and maintain logging and telemetry systems
  • Automate manual operational work via coding and testing
  • Troubleshoot priority incidents and support production systems
Catchpoint logo
Catchpoint SME https://www.catchpoint.com/
201 - 500 Employees
See more Catchpoint offers

Job description

Who monitors the monitoring system? A Site Reliability Engineer at Catchpoint is responsible for supporting the systems that run Catchpoint’s global monitoring platform. In this role, you will interact directly with operations and development teams on building and automating infrastructure (IaC) deployment at scale, then monitoring it to ensure Catchpoint has a scalable and highly reliable system for our customers. 

 

What will success look like in this position? 

The role requires an operational mindset and a love of solving problems on a global scale with solutions that ensure high reliability and availability. You’ll be exploring and making sense of systems telemetry, logs, passive monitoring and using our own synthetic monitors to create an automation that controls, rolls out, and maintains our platform. 

 

Responsibilities 

  • Define and refine the whole service lifecycle - from inception and design, through deployment, operation and finally retirement.
  • Assess services once they are live by measuring and monitoring availability, latency and overall system health. Establish performance baselines, define actions and automations based on data correlated from multiple sources.
  • Design, build, and maintain logging and telemetry systems that are used to manage all services.
  • Design, code, test, and deliver software to automate manual operational work.
  • Troubleshoot priority incidents, facilitate blameless post-mortems and ensure permanent closure of incidents.
  • Identify application patterns and analytics in support of better service level objectives.
  • Deploy and maintain systems that run on multiple cloud providers (AWS, GCP, Azure, Alibaba, Tencent, Oracle, IBM) and physical systems around the world.
  • Be part of an on-call rotation to support production systems. 

 

 Required Skills & Qualifications 

  • Strong Experience/knowledge of administering application servers, web servers, and databases.
  • Familiarity with Infrastructure Automation, configuration management and CI/CD tools (preferably terraform)
  • Experience with multiple cloud platforms (AWS, GCP, Azure)
  • Good networking knowledge and experience with Internet Architecture (BGP, peering, DNS).
  • 2+ years of incident resolution experience in a large-scale operations environment.
  • Hands-on experience with cloud deployment, monitoring, and ops analysis tools such as Prometheus, Elasticsearch, Grafana, Kibana, Splunk, Terraform, Jenkins, etc.
  • 3+ years programming experience with python, bash, PowerShell, C, etc.
  • Virtualization experience required. 
  • BS degree in Computer Science or related technical field involving coding or equivalent practical experience.
  • Appreciation of the value of diversity of opinions 

 

 

Overview

Catchpoint is the Internet Resilience Company™.  The top online retailers, Global2000, CDNs, cloud service providers, and xSPs in the world rely on Catchpoint to increase their resilience by catching any issues in the Internet stack before they impact their business. The Catchpoint platform offers synthetics, RUM, performance optimization, high fidelity data and flexible visualizations with advanced analytics. It leverages thousands of global vantage points (including inside wireless networks, BGP, backbone, last mile, endpoint, enterprise, ISPs and more) to provide unparalleled observability into anything that impacts your customers, workforce, networks, website performance, applications and APIs.

Catchpoint is an equal opportunity employer that strongly prohibits Discrimination and Harassment of any kind. We celebrate diversity and are committed to creating an inclusive and engaging environment for all employees. We welcome applications from all candidates and look forward to receiving yours!

#LI-REMOTE

Required profile

Experience

Level of experience: Mid-level (2-5 years)
Spoken language(s):
English
Check out the description to know which languages are mandatory.

Other Skills

  • Troubleshooting (Problem Solving)
  • Problem Solving
  • Diversity Awareness

Site Reliability Engineer (SRE) Related jobs