Overview:
This is an incredible opportunity to be part of a company that has been at the forefront of AI and high-performance data storage innovation for over two decades. DataDirect Networks (DDN) is a global market leader renowned for powering many of the world's most demanding AI data centers, in industries ranging from life sciences and healthcare to financial services, autonomous cars, Government, academia, research and manufacturing.
"DDN's A3I solutions are transforming the landscape of AI infrastructure." – IDC
“The real differentiator is DDN. I never hesitate to recommend DDN. DDN is the de facto name for AI Storage in high performance environments” - Marc Hamilton, VP, Solutions Architecture & Engineering | NVIDIA
DDN is the global leader in AI and multi-cloud data management at scale. Our cutting-edge data intelligence platform is designed to accelerate AI workloads, enabling organizations to extract maximum value from their data. With a proven track record of performance, reliability, and scalability, DDN empowers businesses to tackle the most challenging AI and data-intensive workloads with confidence.
Our success is driven by our unwavering commitment to innovation, customer-centricity, and a team of passionate professionals who bring their expertise and dedication to every project. This is a chance to make a significant impact at a company that is shaping the future of AI and data management.
Our commitment to innovation, customer success, and market leadership makes this an exciting and rewarding role for a driven professional looking to make a lasting impact in the world of AI and data storage.
Job Description:
As the Lustre and EXA Engineering DevOps team, we are passionate about innovation and automation. We are looking for a Sr. DevOps Engineer with demonstrated experience in improving and scaling infrastructure as well as automating workflows for build, test and deployment stages. We prioritize reliability, scalability, visibility and security to create an ecosystem that empowers our engineering teams to build and deliver faster. Your work will directly support the development of storage solutions that are making an impact across many industries and AI applications.
Responsibilities for this role include but are not limited to:
- Maintain and support infrastructure consisting of bare-metal servers and VMs, ensuring seamless functionality and performance.
- Maintain computer networks including switches, VPNs, routers and other physical hardware.
- Manage and optimize a suite of tools and applications for build, artifact hosting, testing, and reporting.
- Automate configuration and provisioning of infrastructure for development, build, and test infrastructure.
- Build and maintain CI/CD pipelines, streamlining delivery in multiple environments from build through deployment.
- Develop solutions for log analysis and reporting.
- Create and deploy RPM and DEB packages, enabling consistent and efficient software distribution.
- Automate the provisioning of bare-metal servers, VMs, and containers for testing.
- Troubleshoot and resolve complex issues with infrastructure, builds, pipelines, and deployments, focusing on stability and throughput.
- Develop custom command-line tools to simplify infrastructure management and empower engineering teams.
- Respond promptly to engineering requests and resolve time-sensitive issues as they arise.
Required Skills & Experience:
- Bachelor's degree in CS or related technical field with a minimum of 7+ years of relevant industry experience.
- Deep proficiency with Linux systems and command-line tools.
- Strong scripting and automation skills with Bash, Python, Ruby
- Programming experience with Go or Rust. (Prefer Rust)
- Experience with build automation (Make, CMake) and dependency management.
- Proficiency in creating RPM and DEB package specifications.
- Strong understanding of version control best practices (Git).
- Experience with developing and maintaining CI pipelines with GitHub Actions.
- Experience with PXE booting and tools such as Cobbler and Forman
- Experience with Infrastructure as Code (IaC) tools such as Terraform, Pulumi.
- Experience with configuration management tools. (Chef or Ansible preferred).
- Experience with artifact repository management tools. (Artifactory, Nexus)
- Experience with monitoring tools such as Zabbix, Prometheus.
- Experience with log and data analysis and reporting tools such as Splunk, ELK stack, Grafana, etc.
- Security-focused mindset, with a deep understanding of infrastructure, package management, and reporting security best practices.
- Understanding of Agile methodologies and the unique considerations for DevOps teams within Agile frameworks.
- Strong communication skills with the ability to convey technical information clearly and concisely to a variety of stakeholders.
Experience with Lustre filesystem and Infiniband networking is a plus.
DDN:
Our team is highly motivated and focused on engineering excellence.
We look for individuals who appreciate challenging themselves and thrive on curiosity.
Engineers are encouraged to work across multiple areas of the company.
We operate with a flat organizational structure.
All employees are expected to be hands-on and to contribute directly to the company’s mission.
Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important.
All engineers and researchers are expected to have strong communication skills.
They should be able to concisely and accurately share knowledge with their teammates.
Interview Process:
After submitting your application, the team reviews your CV and statement of exceptional work. If your application passes this stage, you will be invited to a 30-minute interview (“phone interview”) during which a member of our team will ask some basic questions. If you clear the initial phone interview, you will enter the main process, which consists of four technical interviews:
- Coding assessment in a language of your choice.
- Systems design: Translate high-level requirements into a scalable, fault-tolerant service.
- Systems hands-on: Demonstrate practical skills in a live problem-solving session.
- Project deep-dive: Present your past exceptional work to a small audience.
- Meet and greet with the wider team.
- Our goal is to finish the main process within one week.
- We don’t rely on recruiters for assessments.
- Every application is reviewed by a member of our technical team.
DataDirect Networks, Inc. is an Equal Opportunity/Affirmative Action employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity, gender expression, transgender, sex stereotyping, sexual orientation, national origin, disability, protected Veteran Status, or any other characteristic protected by applicable federal, state, or local law.