Match score not available

HPC System Engineer

extra holidays - extra parental leave
Remote: 
Full Remote
Contract: 
Work from: 

Offer summary

Qualifications:

Bachelor's or Master's degree in Computer Science, Computer Engineering, or related field, Extensive experience in designing high-performance computing systems.

Key responsabilities:

  • Design and deploy GPU-based HPC cluster with industry-standard components
  • Optimize cluster for AI workloads, manage cluster software and network performance
  • Troubleshoot and resolve cluster-related issues
Jan logo
Jan Startup https://jan.ai/
11 - 50 Employees
See more Jan offers

Job description

Homebrew is an AI R&D Lab. We train our own models, are the creators and maintainers of popular open-source AI tools:

  • Jan: Desktop Copilot (>1 million downloads)
  • Cortex: Local, open-source alternative to OpenAI Platform
  • Menlo: GPU Training Cluster

We are a fully remote company. In the long term, our objective is to train useful, safe AI that helps improve humanity.

 

Job Description

We are seeking an experienced HPC Engineer to design, deploy, and maintain a high-performance computing (HPC) cluster for our AI training workloads. The successful candidate will be responsible for setting up a GPU-based training cluster together with our Research team, and ensuring that works well with our Model Training Algorithms.

 

Key Responsibilities:

  • Design and deploy a GPU-based HPC cluster using industry-standard components (e.g., NVIDIA DGX/HGX, or similar), including the design of nodes (e.g. NVLink, SXM)
  • Configure and optimize the cluster for high-performance computing, focusing on AI workloads (e.g., PyTorch, Torch or similar).
  • Implement and manage cluster management software (e.g., Kubeflow, Slurm or similar).
  • Design cluster for high-bandwidth, low-latency network performance in GPU clusters (InfiniBand, Ethernet RDMA, and/or RoCE), using scalable and efficient network topologies (Fat Tree, Dragonfly, and/or Torus)
  • Troubleshoot and resolve issues related to cluster performance, hardware failures, and software glitches.

 

Requirement

  • Bachelor's or Master's degree in Computer Science, Computer Engineering, or a related field
  • Extensive experience in designing, assembling, and configuring high-performance computing systems
  • Proficient in selecting and integrating HPC hardware components, including CPUs, GPUs, memory, storage, and interconnects
  • Strong knowledge of HPC software stacks, including operating systems, drivers, and specialized applications
  • Experience in designing and operating AI training clusters, including the selection and integration of the necessary hardware and software components
  • Expertise in conducting comprehensive benchmarking tests and analyzing performance data
  • [Plus] Strong networking knowledge, including experience with high-speed interconnects such as Infiniband, RoCE Ethernet, and RDMA
  • [Plus] Experience with setting up and managing Nvidia multi-node training clusters for machine learning applications

 

Benefits

  • We pay an “all-in” pay and you will cover your own insurance/medical from the amount.
  • 14 days leave (and unlimited sick days)
  • Annual equipment budget (once 2 month probation has been completed)

Required profile

Experience

Spoken language(s):
English
Check out the description to know which languages are mandatory.

System Engineer Related jobs