At Cequens, we’re building CPaaS and SaaS communication platforms and we rely on SRE team to push the platform to the highest standards of software performance, reliability, availability and stability. You’ll collaborate with multiple cross-functional teams to develop solutions to manage our platforms. If you’re passionate about automation, software reliability, and continuous improvement, this role is perfect for you.
Key Roles and Responsibilities:
- Work with different teams to implement and improve SLIs, SLAs, and SLOs.
- Proactively monitor system health, performance, and reliability metrics.
- Design, implement, and maintain automation tools and infrastructure to streamline operations tasks.
- Conduct capacity planning and scalability assessments to accommodate growing demands.
- Collaborate with software development teams to improve system reliability, performance, and efficiency.
- Participate in incident response activities, diagnosing and resolving issues to minimize downtime and service disruptions.
- Contribute to the evolution of best practices and standards for reliability engineering within the organization.
- Owning effective post-mortems and ensuring actions are followed-up.
Requirements
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
- Awareness of scalability, reliability, and security guidelines
- Proficiency in Linux, NGINX, HAProxy, Git.
- Strong scripting skills in Bash and/or Python.
- Experience with configuration management tools like Ansible.
- Familiarity with cloud platforms like AWS, Azure, or Google Cloud.
- Understanding of networking principles and protocols (TCP/IP, HTTP, DNS, etc.).
- Knowledge of containerization technologies (Docker, Kubernetes) and orchestration tools.