We are looking for passionate and hardworking Site Reliability Engineering Lead to continue our focus in providing our customers the highest quality services experience. Our services have to scale globally, stay highly available, and "just work."
If you love designing, engineering, and running systems and infrastructure that will help thousands of customers, then this is the place for you.
Description:
The RAPToR platform is a distributed cloud-based platform that serves hundreds of geographically dispersed clients, and this comes with unique challenges. As a SRE professional you'll need to solve these problems using data, teamwork, and your expertise. Our SREs own the full RAPToR platform stack of the production environment; from covering issues
related to the application functionalities to addressing infrastructure disasters -- our responsibilities are both broad and deep.
The RAPToR platform runs on the Microsoft Azure cloud and follows a microservices architecture. We run a mix of open source, vendor licensed, and internally developed tools to perform functions such as provisioning, software deployment, logging, and monitoring. You'll learn these tools and have opportunities to improve them. Our team is collaborative; we work
closely with the development teams we support to deliver the best results. We aim to balance the best solution with the need to get things done for each engineering challenge we face.
Responsibilities:
Lead and mentor a team of SREs to ensure the reliability, availability, and performance of our large distributed web platform.
Foster a collaborative and inclusive team environment, encouraging continuous learning and professional growth.
Set clear goals and expectations for the SRE team, providing regular feedback and performance evaluations.
Develop and implement automation strategies to streamline operations, reduce manual intervention, and improve overall system reliability.
Identify opportunities for automation across the infrastructure and application lifecycle, from deployment to monitoring and incident response.
Ensure that automation tools and scripts are well-documented, maintainable, and scalable.
Design and implement preventive infrastructure monitoring solutions, including synthetic tests, to proactively identify and address potential issues.
Develop and maintain monitoring dashboards and alerting systems to provide real-time visibility into system health and performance.
Continuously improve monitoring and alerting processes to reduce false positives and ensure timely detection of critical issues.
Collaborate with engineering teams to ensure that observability and resiliency requirements are met for all new and existing services.
Provide guidance on best practices for logging, monitoring, and alerting to ensure comprehensive observability.
Work closely with development teams to design and implement resilient architectures that can withstand failures and recover quickly.
Coordinate the support of code release and go-live activities, ensuring smooth and reliable deployments.
Conduct post-release reviews to identify areas for improvement and ensure that lessons learned are applied to future releases.
Conduct regular performance tuning exercises to optimize system performance and ensure efficient resource utilization.
Perform capacity planning to anticipate future growth and ensure that the infrastructure can scale to meet demand.
Plan and execute disaster recovery exercises to validate the effectiveness of backup and recovery procedures.
Stay up-to-date with industry trends and best practices in SRE, cloud computing, and automation.
Continuously evaluate new tools and technologies to enhance the reliability, scalability, and efficiency of the platform.
Share knowledge and insights with the team and the broader organization to promote a culture of continuous improvement and innovation.
Requirements:
Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience).
Strong communication and leadership skills, with the ability to work effectively in a collaborative team environment.
Proven experience as an SRE or in a similar role, with a focus on large distributed web platforms.
Strong expertise in Azure cloud services and infrastructure management.
Proficiency in Infrastructure as Code (IaC) tools such as AWS CloudFormation/CDK, Azure Bicep/ARM templates, Terraform, or similar.
Experience with container orchestration platforms like Azure Container Apps and Kubernetes.
Familiarity with serverless computing frameworks such as Azure Functions or AWS Lambda.
Knowledge of Content Delivery Networks (CDNs) and their configuration and management.
Experience with heavy loaded SQL Server maintenance, performance monitoring and tuning
Experience with messaging and streaming platforms like Azure ServiceBus, Azure EventHub, Kafka
Strong scripting and automation skills using languages such as Python, Bash, or PowerShell.
Experience with monitoring and observability tools such as Azure Monitor, AWS CloudWatch, Prometheus, Grafana.
Excellent problem-solving skills and the ability to troubleshoot complex issues in a distributed environment.
Preferred Qualifications:
Experience with CI/CD pipelines and tools such as Jenkins, GitLab CI, or Azure DevOps.
Familiarity with .NET ecosystem and C# language.
Knowledge of security best practices and compliance requirements in cloud environments.
Familiarity with Agile and DevOps methodologies.