Match score not available

Chaos Engineering Architect

Remote: 
Full Remote
Contract: 
Experience: 
Senior (5-10 years)
Work from: 

Offer summary

Qualifications:

Bachelor’s degree in Computer Science or Engineering, 5+ years experience in software engineering or related fields, Proven chaos engineering experience in cloud environments, Familiar with chaos engineering and observability tools, Strong knowledge of AWS, Azure, GCP architectures.

Key responsabilities:

  • Develop and execute chaos engineering strategies
  • Implement chaos experiments simulating failure scenarios
  • Collaborate with teams to integrate chaos practices into CI/CD pipeline
  • Utilize observability tools to monitor system performance
  • Create documentation and conduct training on chaos methodologies
Scicom Infrastructure Services logo
Scicom Infrastructure Services Information Technology & Services SME https://www.scicominfra.com/
11 - 50 Employees
See more Scicom Infrastructure Services offers

Job description

Overview:

We are seeking a talented and experienced Chaos Engineering Architect to join our dynamic team. In this role, you will be responsible for designing and implementing chaos engineering practices to enhance the resilience and reliability of our cloud-based systems. You will work closely with cross-functional teams to create chaos engineering drills and ensure our observability tools provide meaningful insights into system performance and behavior under stress.

 

Key Responsibilities:

  • Design and Implementation: Develop and execute chaos engineering strategies, including chaos experiments and drills, to identify weaknesses in our cloud infrastructure and applications.
  • Cloud Environment Expertise: Leverage your experience with cloud platforms (AWS, Azure, GCP) to implement chaos experiments that simulate various failure scenarios, ensuring systems can withstand unexpected disruptions.
  • Collaboration: Partner with development, operations, and QA teams to integrate chaos engineering practices into the CI/CD pipeline, fostering a culture of reliability and resilience.
  • Observability Enhancements: Utilize observability tools and practices (e.g., Prometheus, Grafana, ELK Stack) to monitor and analyze system performance, helping teams understand the impact of chaos experiments.
  • Documentation and Training: Create comprehensive documentation for chaos engineering methodologies and conduct training sessions to upskill team members on best practices.
  • Continuous Improvement: Analyze results from chaos experiments to drive improvements in system design, architecture, and operational practices.
  • Incident Management: Collaborate with incident response teams to refine incident management processes and improve system recovery times based on findings from chaos experiments.

Qualifications:

  • Education: Bachelor’s degree in Computer Science, Engineering, or a related field; advanced degree preferred.
  • Experience:
    • 5+ years of experience in software engineering, systems architecture, or related fields.
    • Proven experience with chaos engineering principles and practices in cloud environments.
    • Familiarity with chaos engineering tools (e.g., Gremlin, Chaos Monkey, Litmus) and observability platforms.
  • Technical Skills:
    • Strong knowledge of cloud computing architectures (AWS, Azure, GCP).
    • Proficiency in programming/scripting languages (Python, Go, Java, etc.) for automation of chaos experiments.
    • Experience with observability tools (e.g., Prometheus, Grafana, Datadog) to derive insights from chaos tests.
  • Soft Skills:
    • Excellent problem-solving skills and ability to think critically under pressure.
    • Strong communication skills to effectively share insights and findings with technical and non-technical stakeholders.
    • Ability to work collaboratively in a fast-paced, agile environment.

Preferred Qualifications:

  • Experience with site reliability engineering (SRE) practices.
  • Familiarity with microservices architectures and container orchestration (e.g., Kubernetes).
  • Understanding of incident response and disaster recovery planning.

Required profile

Experience

Level of experience: Senior (5-10 years)
Industry :
Information Technology & Services
Spoken language(s):
English
Check out the description to know which languages are mandatory.

Other Skills

  • Collaboration
  • Critical Thinking
  • Problem Solving
  • Training And Development
  • Communication

Related jobs