Genmo Logo

Genmo

Senior Site Reliability Engineer — GPU Infrastructure

Job Posted 22 Days Ago Reposted 22 Days Ago
In-Office or Remote
Hiring Remotely in San Francisco, CA
Senior level
In-Office or Remote
Hiring Remotely in San Francisco, CA
Senior level
As a Senior Site Reliability Engineer, lead the operation of GPU clusters, manage Kubernetes, and implement Infrastructure-as-Code practices while optimizing performance and ensuring reliability.
The summary above was generated by AI

We are Genmo, a research lab dedicated to building open, state-of-the-art models for video generation towards unlocking the right brain of AGI. Join us in shaping the future of AI and pushing the boundaries of what's possible in video generation.

What You’ll Do
  • Own the design and day‑to‑day operation of GPU clusters that train and serve frontier generative models.

  • Lead production Kubernetes operations: GPU scheduling, cluster upgrades, multi‑cluster federation.

  • Define and implement Infrastructure‑as‑Code (Terraform, Helm, Ansible) and GitOps workflows with Argo CD or Flux.

  • Build CI/CD pipelines, automated testing, and rollout strategies for infra changes.

  • Develop an observability stack (Prometheus, Grafana, OpenTelemetry, eBPF) plus GPU telemetry with NVIDIA DCGM.

  • Optimize high‑performance networking (InfiniBand/RDMA) and debug perf bottlenecks.

  • Run and continuously improve the 24×7 on‑call rotation; lead post‑incident reviews.

  • Partner with researchers and engineers, communicate crisply, and ship with a high‑ownership mindset.

Minimum Qualifications
  • BS/MS/PhD in CS, EE, or related field.

  • 3+ yrs SRE/DevOps in production; 2+ yrs managing large Kubernetes fleets.

  • Expert‑level Kubernetes experience.

  • Proficient in Python and Bash and IaC tools (Terraform, Helm, Ansible).

  • Track record of shipping and operating large‑scale infrastructure with high reliability and clear communication.

Nice to Have
  • Multi‑cluster / multi‑cloud (AWS, GCP, Azure, bare‑metal) production experience.

  • Hands‑on with containerized GPU stacks (nvidia‑container‑toolkit, GPU Operator)

  • GPU schedulers such as Slurm or Kueue.

  • Familiarity with CI/CD tooling (GitHub Actions, BuildKit).

  • Prior work with distributed training, model‑serving patterns, or other ML/GPU workloads.

Machine‑learning depth is a plus—not a prerequisite. We’ll help you level up if needed.

Genmo is an Equal Opportunity Employer. Candidates are evaluated without regard to age, race, color, religion, sex, disability, national origin, sexual orientation, veteran status, or any other characteristic protected by federal or state law. Genmo, Inc. is an E-Verify company and you may review the Notice of E-Verify Participation and the Right to Work posters in English and Spanish.

Top Skills

Ansible
Argo Cd
Bash
Ci/Cd
Ebpf
Flux
Grafana
Helm
Infiniband
Kubernetes
Nvidia Dcgm
Opentelemetry
Prometheus
Python
Rdma
Terraform

Similar Jobs

An Hour Ago
Remote or Hybrid
California, USA
85K-128K Annually
Senior level
85K-128K Annually
Senior level
AdTech • Digital Media • Marketing Tech
As a Senior Solutions Engineer, you'll lead client implementations of Strata's products, ensuring technical needs are met and guiding customers to optimize their use of the platform.
Top Skills: Amazon Web Services (Aws)Api ManagementDatadogJavaScriptPythonSQLVisual Studio
An Hour Ago
Remote
USA
102K-102K
Mid level
102K-102K
Mid level
Computer Vision • Healthtech • Information Technology • Logistics • Machine Learning • Software • Manufacturing
The Account Manager will manage a portfolio of dental accounts, focusing on driving Net Revenue Retention through customer engagement and strategic account planning, while collaborating with sales and support teams to enhance client satisfaction.
An Hour Ago
Remote
USA
82K-95K
Mid level
82K-95K
Mid level
Computer Vision • Healthtech • Information Technology • Logistics • Machine Learning • Software • Manufacturing
The Field Trainer leads in-person training for dental practices, focusing on the adoption of digital technologies and workflows, while assessing readiness and providing coaching and support to clinicians and staff.
Top Skills: Digital DentistryDigital WorkflowsIntraoral Scanning

What you need to know about the Seattle Tech Scene

Home to tech titans like Microsoft and Amazon, Seattle punches far above its weight in innovation. But its surrounding mountains, sprinkled with world-famous hiking trails and climbing routes, make the city a destination for outdoorsy types as well. Established as a logging town before shifting to shipbuilding and logistics, the Emerald City is now known for its contributions to aerospace, software, biotech and cloud computing. And its status as a thriving tech ecosystem is attracting out-of-town companies looking to establish new tech and engineering hubs.

Key Facts About Seattle Tech

  • Number of Tech Workers: 287,000; 13% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Amazon, Microsoft, Meta, Google
  • Key Industries: Artificial intelligence, cloud computing, software, biotechnology, game development
  • Funding Landscape: $3.1 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Madrona, Fuse, Tola, Maveron
  • Research Centers and Universities: University of Washington, Seattle University, Seattle Pacific University, Allen Institute for Brain Science, Bill & Melinda Gates Foundation, Seattle Children’s Research Institute
By clicking Apply you agree to share your profile information with the hiring company.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account