Match score not available

Expert Site Reliability Engineer - GCP

Remote: 
Full Remote
Contract: 
Experience: 
Mid-level (2-5 years)
Work from: 

Offer summary

Qualifications:

Bachelor’s degree in Computer Science, Information Technology, or equivalent experience., 8+ years of experience in cloud operations, reliability engineering, or infrastructure management., Certifications such as GCP Professional Cloud Architect or GCP Professional DevOps Engineer are preferred., Expertise in Google Cloud services, Infrastructure as Code tools like Terraform and Ansible, and strong knowledge of SRE principles..

Key responsabilities:

  • Manage and maintain GCP infrastructure to ensure high availability and reliability.
  • Monitor resource utilization and performance trends for capacity planning and cost optimization.
  • Design and implement resilient cloud architectures using Infrastructure as Code tools.
  • Collaborate with DevOps teams to streamline deployment processes and maintain documentation.

DeepSource GmbH logo
DeepSource GmbH Startup http://www.deepsource.ai
51 - 200 Employees
See all jobs

Job description

We are seeking an experienced Google Cloud Platform (GCP) Site Reliability Engineer (SRE) to manage daily operational workloads, ensuring the reliability, scalability, and cost efficiency of cloud infrastructure. The ideal candidate will have deep expertise in capacity planning, performance optimization, infrastructure design, and FinOps best practices to maintain an efficient and cost-effective GCP environment.

Key Responsibilities:

• Operations & Reliability: Manage and maintain GCP infrastructure, ensuring high availability, scalability, and system reliability.

• Capacity Planning & Optimization: Monitor and forecast resource utilization, performance trends, and infrastructure scaling needs to optimize cloud costs and efficiency.

• Infrastructure Design & Automation: Design and implement highly available, fault-tolerant, and resilient cloud architectures, leveraging Infrastructure as Code (IaC) tools such as Terraform and Ansible.

• Performance Monitoring & Incident Response: Utilize Google Cloud Monitoring, Cloud Logging, and third-party tools to proactively detect and resolve performance issues.

• FinOps & Cost Management: Analyze and optimize cloud spending, implement cost controls, recommend rightsizing strategies, and ensure efficient resource allocation.

• Security & Compliance: Implement best practices for IAM, network security, encryption, and compliance frameworks (SOC2, ISO 27001, NIST).

• CI/CD & DevOps Integration: Collaborate with DevOps teams to streamline deployment processes, automate workflows, and optimize application performance.

• Disaster Recovery & High Availability: Design and implement disaster recovery (DR) plans, backup strategies, and failover mechanisms to ensure business continuity.

• Documentation & Collaboration: Maintain comprehensive documentation of infrastructure, best practices, and optimization strategies while working closely with cross-functional teams.

Requirements

Qualifications:

• Education: Bachelor’s degree in Computer Science, Information Technology, or equivalent experience.

• Experience: 8+ years of experience in cloud operations, reliability engineering, or infrastructure management.

• Certifications: GCP Professional Cloud Architect, GCP Professional DevOps Engineer, or equivalent is preferred.

• Technical Proficiency:

• Expertise in Google Cloud networking, Compute Engine, Kubernetes (GKE), Cloud Functions, and Cloud Storage.

• Strong knowledge of Terraform, Ansible, or other Infrastructure as Code (IaC) tools.

• Experience with Google Kubernetes Engine (GKE), microservices, and container orchestration.

• Hands-on experience with FinOps tools and cost optimization strategies in cloud environments.

• Familiarity with monitoring and logging solutions such as Google Operations Suite (formerly Stackdriver), Prometheus, Grafana.

• Experience with CI/CD pipelines, automation, and GitOps best practices.

• Strong understanding of SRE principles, SLAs, SLOs, and error budgets.

Preferred Qualifications:

• Experience with multi-cloud or hybrid cloud environments.

• Knowledge of serverless computing and cloud-native application design.

• Understanding of ITIL frameworks for incident, problem, and change management

Required profile

Experience

Level of experience: Mid-level (2-5 years)
Spoken language(s):
English
Check out the description to know which languages are mandatory.

Other Skills

  • Collaboration

Site Reliability Engineer (SRE) Related jobs