Offer summary

Qualifications:

Bachelor's or Master's degree in Computer Science, Computer Engineering, or related field, Extensive experience in designing high-performance computing systems.

Key responsabilities:

Design and deploy GPU-based HPC cluster with industry-standard components

Optimize cluster for AI workloads, manage cluster software and network performance

Troubleshoot and resolve cluster-related issues

Job description

Homebrew is an AI R&D Lab. We train our own models, are the creators and maintainers of popular open-source AI tools:

Jan: Desktop Copilot (>1 million downloads)
Cortex: Local, open-source alternative to OpenAI Platform
Menlo: GPU Training Cluster

We are a fully remote company. In the long term, our objective is to train useful, safe AI that helps improve humanity.

Job Description

We are seeking an experienced HPC Engineer to design, deploy, and maintain a high-performance computing (HPC) cluster for our AI training workloads. The successful candidate will be responsible for setting up a GPU-based training cluster together with our Research team, and ensuring that works well with our Model Training Algorithms.

Key Responsibilities:

Design and deploy a GPU-based HPC cluster using industry-standard components (e.g., NVIDIA DGX/HGX, or similar), including the design of nodes (e.g. NVLink, SXM)
Configure and optimize the cluster for high-performance computing, focusing on AI workloads (e.g., PyTorch, Torch or similar).
Implement and manage cluster management software (e.g., Kubeflow, Slurm or similar).
Design cluster for high-bandwidth, low-latency network performance in GPU clusters (InfiniBand, Ethernet RDMA, and/or RoCE), using scalable and efficient network topologies (Fat Tree, Dragonfly, and/or Torus)
Troubleshoot and resolve issues related to cluster performance, hardware failures, and software glitches.

Requirement

Bachelor's or Master's degree in Computer Science, Computer Engineering, or a related field
Extensive experience in designing, assembling, and configuring high-performance computing systems
Proficient in selecting and integrating HPC hardware components, including CPUs, GPUs, memory, storage, and interconnects
Strong knowledge of HPC software stacks, including operating systems, drivers, and specialized applications
Experience in designing and operating AI training clusters, including the selection and integration of the necessary hardware and software components
Expertise in conducting comprehensive benchmarking tests and analyzing performance data
[Plus] Strong networking knowledge, including experience with high-speed interconnects such as Infiniband, RoCE Ethernet, and RDMA
[Plus] Experience with setting up and managing Nvidia multi-node training clusters for machine learning applications

Benefits

We pay an “all-in” pay and you will cover your own insurance/medical from the amount.
14 days leave (and unlimited sick days)
Annual equipment budget (once 2 month probation has been completed)

Required profile