Match score not available

Site Reliability Engineer

extra holidays - extra parental leave - fully flexible

Remote:

Full Remote

Contract:

Full time

Experience:

Senior (5-10 years)

Work from:

Gabon, Georgia (USA), United States

Offer summary

Qualifications:

5+ years in Site Reliability Engineering, Expertise in Datadog and Azure Application Insights, Experience with .NET, Node.js, React, Hands-on with Azure App Services and SQL databases, Knowledge of CI/CD pipelines in Azure DevOps.

Key responsabilities:

Implement monitoring systems using Datadog and Azure Application Insights
Integrate monitoring tools in applications for performance oversight
Establish incident response plans to minimize downtime
Optimize Azure DevOps pipelines for efficiency and reliability
Maintain application performance and troubleshooting via SQL queries

Edible Arrangements Retail (Super / Hypermarket) Large https://www.edible.com/

1001 - 5000 Employees

See all jobs

Job description

Senior Site Reliability Engineer (SRE)

Who are we and what do we do?

Fruit was just the beginning. Since our founding in 1999, we’ve evolved over 25+ years into an industry leader and modern gifting destination for celebrating the moments that matter. In addition to a robust online e-commerce hub, our vast retail footprint includes nearly 1,000 locally owned and operated franchise locations globally.

With offerings that go beyond our iconic fresh fruit bouquets to include baked treats, fresh flowers, dessert boards, platters, and more, our vast collection of delicious treats and innovative gifts are perfect for treating yourself and others.

No matter the occasion or moment, there’s an edible® for that.

Through all our incredible years, we’ve remained committed to our 5Ps:

• Our promise– Experiences that WOW.

• Our products–Remarkably fresh.

• Our places– Interactive and creative.

• Our people– Create special memories.

• Our purpose–To celebrate what’s good in life.

Purpose:

As a Senior Site Reliability Engineer (SRE), you will be responsible for ensuring the resilience and reliability of our e-commerce applications through monitoring, automation, and proactive site maintenance. You will leverage Datadog, Azure Application Insights, and other industry-standard tools to develop robust monitoring systems that enhance site awareness, detect and respond to incidents, and maintain high availability. You will also drive collaboration across engineering teams to build a proactive approach to system health, site reliability, and incident management.

Responsibilities:

Develop, implement, and manage monitoring and alerting systems using Datadog, Azure Application Insights, and other related technologies to gain real-time awareness of system health and potential issues.
Ensure integration of Datadog with .NET, Node.js and React-based applications for comprehensive monitoring of application performance and health.
Establish proactive monitoring practices to reduce site outages, gain insight into system performance, and identify blockers within Azure DevOps pipelines.
Design and implement Standard Operating Procedures (SOPs) to effectively respond to and resolve incidents, minimizing downtime and ensuring prompt recovery.
Collaborate with engineering and product teams to establish and execute comprehensive incident response plans, focusing on improving the availability, performance, and reliability of e-commerce platforms.
Optimize Azure DevOps pipelines to ensure blockers, errors, and any build issues are proactively addressed, enhancing site deployment efficiency and reliability.
Maintain and improve application performance and resilience through enhancements in Azure Application Services, Azure Front Door, and Azure Application Gateway.
Execute SQL queries to assess and troubleshoot database performance and availability issues related to the operational health of the site.
Work closely with developers to ensure that monitoring tools are embedded effectively into the development cycle and are aligned with the business needs.
Create detailed documentation, including SOPs, best practices, incident management guides, and monitoring configurations.
Stay current with emerging monitoring technologies and identify opportunities to apply them to enhance the platform's reliability and scalability.
Promote a culture of learning and proactive improvement through root cause analysis and post-incident reviews to prevent repeat occurrences.

Requirements:

5+ years of experience in Site Reliability Engineering, preferably within an e-commerce or high-traffic web application environment.
Strong expertise with Datadog, including setting up integrations, creating custom metrics, dashboards, and alerts, specifically in .NET, Node.js, and React applications.
Proven experience with Azure Application Insights, Azure DevOps, and the ability to implement monitoring and alerting solutions in cloud environments.
Hands-on experience managing and optimizing Azure App Services, Azure Front Door, Azure Application Gateway, and SQL databases from a resilience and performance standpoint.
Familiarity with SOP development for incident management, proactive monitoring, and site reliability.
Knowledge of CI/CD pipelines in Azure DevOps, and experience in identifying and resolving build blockers and pipeline issues.
Strong skills in writing SQL queries to diagnose and resolve issues.

Essential Competencies:

Excellent interpersonal skills, with an emphasis on collaboration, clear communication, and the ability to explain technical concepts to non-technical stakeholders.
Ability to work in a fast-paced environment, with strong analytical and problem-solving skills, and a proactive mindset towards automation and improvement.

What will set you apart:

Advanced certifications in Azure (e.g., Azure DevOps Engineer Expert, Azure Solutions Architect).
Extensive experience with high-traffic e-commerce applications and a track record of ensuring uptime and resilience.
Experience with other monitoring and observability tools (e.g., Grafana, Prometheus) is a plus.

What We Offer:

Onsite work environment with work-from-home flexibility, fostering collaboration and relationship building with peers, cross-functional partners and leadership.
The stability and resources of an industry-leading company successfully operating for 25 years, with the agility and innovation of a startup, allowing you to make a significant impact and shape our future.
Growth & Development – Each team member has a visible and immediate impact on the business, offering abundant opportunities for personal and professional growth as we scale in size and sophistication.
Healthcare plans that include health/dental/vision insurance, 401K Plan, company-paid life insurance and short-term disability, flexible spending account options and more.
Paid time off, including sick days & holidays to support work-life balance.

We are proud to be an EEO/AA employer. Applicants for employment are considered without regard to race, creed, color, religion, sex, sexual orientation, marital status, national origin, age, and disability, status as a veteran, Vietnam Era Veteran, or being a member of the Reserves or National Guard.