Match score not available

Site Reliability Engineer

Remote:

Full Remote

Work from:

South Africa

Offer summary

Qualifications:

Experience with monitoring and alerting tools., Expert in cloud operations and automation., Strong background in system performance tuning., Familiarity with telecommunications technologies..

Key responsabilities:

Develop real-time monitoring and alert systems.
Identify anomalies to reduce downtime proactively.

ANYWHERE365® SME http://anywhere365.io

201 - 500 Employees

See all jobs

Job description

Founded in 2010 in The Netherlands, Anywhere365 is a global leader in Enterprise Dialogue Management, with a vision to ensure every employee and customer feels heard, understood, and valued. With around 240 employees in working from 22 different countries, we partner with over 2,000 leading enterprises, including Mazda, the UN International Organization for Migration, Adecco Group, and the University of Cape Town, to deliver exceptional customer experiences through the power of Microsoft Teams and AI-driven insights. Our commitment to innovation, customer focus, and accountability drives our success.

We are looking for a highly skilled and driven Site Reliability Engineer (SRE) to join our team with a strong emphasis on communications technologies, cloud operations, and system performance. This role requires expertise in monitoring, alerting, anomaly detection, automation, security, and performance tuning across our critical communications platforms. You will be responsible for the reliability, availability, and performance of services such as SIP, Skype for Business, and Azure Communication Services (ACS). Your role will also focus on optimizing resource utilization, cost management, and ensuring disaster recovery and business continuity (BCP/DR).

Main responsibilities:

Develop and maintain real-time monitoring and alerting systems using tools like Prometheus, Grafana, and the ELK stack to ensure system health and performance.
Identify and resolve anomalies and bottlenecks proactively, reducing downtime through automated detection and alert mechanisms.
Automate infrastructure provisioning, scaling, and patching using tools like Terraform and Azure DevOps across Kubernetes, Windows, and Linux environments.
Build self-healing systems and leverage Kubernetes operators, CI/CD pipelines, and event-driven automation to improve reliability.
Analyze and optimize system performance for latency-sensitive services, including VoIP, video, and messaging.
Implement cloud cost optimization strategies, such as using Reserved Instances, rightsizing virtual machines, and leveraging Azure Cost Management tools.
Strengthen system security by enforcing best practices for hardening, vulnerability patching, and incident management in collaboration with security teams.
Design and execute robust disaster recovery plans, ensuring fault-tolerant architectures and reliable backup and restore strategies.