Lambda Logo

Lambda

Staff Software Engineer - Managed Kubernetes

Reposted 6 Days Ago
Be an Early Applicant
Hybrid
Bellevue, WA, USA
314K-465K Annually
Senior level
Hybrid
Bellevue, WA, USA
314K-465K Annually
Senior level
Join Lambda as a Staff Engineer to lead the development of a Managed Kubernetes platform for AI workloads, focusing on scalable orchestration, integration with NVIDIA tools, and technical leadership across infrastructure teams.
The summary above was generated by AI

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers. Our customers range from AI researchers to enterprises and hyperscalers. Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence. One person, one GPU.

If you'd like to build the world's best AI cloud, join us.

 

*Note: This position requires presence in our San Francisco, San Jose, or Bellevue office location 4 days per week; Lambda’s designated work from home day is currently Tuesday.

 

About the Role

Lambda is building the AI Cloud of the future. We are seeking a Staff Engineer to help our development of our Managed Kubernetes platform. Think GKE, but purpose-built for AI workloads and running on bare metal. This is a foundational technical leadership role where you will shape the infrastructure that powers the next generation of AI training and inference at scale.

As a Staff Engineer on our Orchestration team, you will collaborate to help drive the technical vision for Lambda's managed orchestration services, including Managed Kubernetes, Managed Slurm on Kubernetes, and higher-level platform services for inference and AIOps. You'll work at the intersection of distributed systems, GPU-accelerated computing, and Cloud Native infrastructure to build systems that are reliable, performant, and elegantly simple for our customers.

This is not a role for someone who just operates Kubernetes; it is a technical leadership role for an engineer who has synthesized the core domains of infrastructure (compute, network, storage, security) and can design holistic solutions across all of them. You'll be working closely with NVIDIA's open-source ecosystem, and partnering with internal teams across the stack to deliver a world-class managed platform.

 

What You'll Do:

Product Engineering

  • Drive technical vision for Lambda's Managed Kubernetes bare-metal platform, including control plane scalability, multi-tenancy, cluster lifecycle management, and high availability

  • Integrate and extend NVIDIA's open-source ecosystem: GPU Operator, Network Operator, DCGM, NCCL, and emerging projects like AICR and Topograph for topology-aware scheduling and placement

  • Design GPU-aware orchestration systems

  • Lead development of services that power our managed services

  • Inform on and help with networking solutions for AI workloads: CNI integration (Cilium, Multus), high-performance fabrics (InfiniBand, RoCE), RDMA, and GPUDirect. You will work closely with our Network team to define and drive requirements

  • Inform and help with storage architecture requirements for AI workloads. You will partner with Storage teams on what managed K8s, Slurm, and future services need

  • Build the foundation for Managed Slurm on Kubernetes, enabling traditional HPC workloads to run seamlessly alongside Kubernetes workload

  • Design higher-level platform services for inference, including model serving infrastructure, autoscaling based on inference load, and multi-model deployment patterns

  • Design self-healing systems and automation for incident response, root cause analysis, and platform resilience

  • Lead chaos engineering efforts to validate system behavior under failure conditions at scale

  • Establish operational excellence for a managed service: upgrade automation, security patching, and zero-downtime maintenance

Cross-Functional Infrastructure Leadership

  • Serve as the technical bridge between Orchestration and other infrastructure teams (Network, Storage, Security), translating platform requirements into actionable specifications

  • Drive infrastructure-wide decisions that enable successful managed services. You’re someone who understands what's needed end-to-end, not just at the Kubernetes layer.

  • Provide input on bare-metal provisioning, network topology, and storage systems to ensure they meet the needs of managed the services being built by the Orchestration organization

  • Champion consistency and standardization across Lambda's infrastructure stack

  • Work directly with customers and internal teams to understand existing deployments and chart a path to the managed platform

Technical Leadership

  • Set technical direction for Kubernetes services across the Orchestration team, influencing roadmap and prioritization

  • Drive reviews and design sessions, ensuring we build systems that are scalable, maintainable, and aligned with customer needs

  • Mentor and grow engineers, establishing best practices for Kubernetes development, distributed systems, and Cloud Native engineering

  • Collaborate cross-functionally with Network, Storage, Security, and Customer Success teams

  • Engage with NVIDIA and the open-source community to stay current on GPU orchestration technologies and contribute back where appropriate

  • Represent Lambda externally through technical blog posts, conference talks, and strategic customer engagements

  • Shape our AIOps vision: design intelligent systems for automated capacity planning, anomaly detection, and predictive maintenance of cloud infrastructure

Who You Are

You are a creative, innovative engineer who operates at high velocity. You don't just solve problems. You find elegant solutions and ship them quickly. You embrace modern tools and AI-assisted development (like Claude Code) to accelerate your productivity and multiply your impact. You're energized by building new things, not maintaining the status quo.

Required Qualifications

  • 10+ years of experience in software engineering, platform engineering, or SRE, with at least 5 years focused on Kubernetes at scale

  • Expert-level understanding of Kubernetes internals: API machinery, controllers, schedulers, operators, CRDs, CSI, CNI, and the extension patterns that make Kubernetes powerful

  • Holistic infrastructure expertise: you've synthesized knowledge across compute, networking, storage, and security, not just Kubernetes in isolation. You can build solutions that span the full stack.

  • Strong software engineering skills in Go (required) and Python; you write production-quality code, not just scripts

  • Deep experience with GPU orchestration in Kubernetes: NVIDIA GPU Operator, device plugins, DCGM, MIG, time-slicing, and GPU-aware scheduling. Familiarity with NVIDIA Network Operator and GPUDirect is strongly preferred.

  • Proven track record of technical leadership: driving design decisions across teams, mentoring engineers, and influencing infrastructure direction beyond your immediate scope

  • Deep experience designing and operating managed services or multi-tenant platforms. You understand what it takes to run infrastructure for external customers

  • Strong understanding of distributed systems principles: consensus, fault tolerance, consistency models, and graceful degradation

  • Experience with observability at scale: Prometheus, Grafana, distributed tracing, and building actionable alerting systems

  • Solid knowledge of Linux systems and networking (L2-L7), including high-performance networking concepts (RDMA, InfiniBand, RoCE)

  • Experience with infrastructure-as-code and GitOps workflows

Preferred Qualifications

  • Experience building and operating managed Kubernetes services (GKE, EKS, AKS, or similar) or working on Kubernetes control plane components

  • Hands-on experience with NVIDIA's open-source ecosystem beyond GPU Operator: Network Operator, NCCL tuning, Topograph, AICR, or similar emerging projects

  • Familiarity with HPC and traditional job schedulers (Slurm) and Kubernetes-native batch scheduling (KAI, Volcano, Kueue)

  • Background in confidential computing

  • Experience migrating customers or workloads from legacy/bespoke infrastructure to standardized platforms

  • Contributions to CNCF projects, Kubernetes SIGs, or NVIDIA open-source projects

  • Familiarity with security and compliance in multi-tenant environments: RBAC, Pod Security Standards, network policies, workload isolation

  • Background in ML infrastructure: training clusters, inference serving, simulation

Why Lambda

Lambda is building the essential infrastructure for the AI era. We're not just another cloud provider: we're a company founded by ML practitioners, for ML practitioners. Our customers include leading AI research labs and enterprises pushing the boundaries of what's possible with artificial intelligence.

What makes this role special:

  • You'll be building core platform services the world’s largest AI companies will consume

  • NVIDIA partnership: Deep integration with NVIDIA's GPU and networking stack, working with cutting-edge open-source tooling

  • Real technical challenges: Massive scale GPU clusters and the unique demands of AI workloads

  • Cross-stack influence: Shape not just Kubernetes, but the network, storage, and compute infrastructure that supports it

  • Direct impact: Your work enables AI breakthroughs. Every model trained on Lambda benefits from systems you build

  • World-class team: Work alongside engineers with deep expertise in ML, systems, and infrastructure

Salary Range Information

The annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.

 

About Lambda

  • Founded in 2012, with 500+ employees, and growing fast

  • Our investors notably include TWG Global, US Innovative Technology Fund (USIT), Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, Gradient Ventures, Mercato Partners, SVB, 1517, and Crescent Cove

  • We have research papers accepted at top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG

  • Our values are publicly available: https://lambda.ai/careers

  • We offer generous cash & equity compensation

  • Health, dental, and vision coverage for you and your dependents

  • Wellness and commuter stipends for select roles

  • 401k Plan with 2% company match (USA employees)

  • Flexible paid time off plan that we all actually use

A Final Note:

You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.

Equal Opportunity Employer

Lambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.

Similar Jobs

6 Days Ago
Hybrid
Bellevue, WA, USA
314K-465K Annually
Senior level
314K-465K Annually
Senior level
Software
Lead the development of Lambda's Managed Kubernetes platform, overseeing technical vision, infrastructure solutions, and integration with NVIDIA’s ecosystem while mentoring engineers and ensuring system reliability.
Top Skills: CiliumDcgmGitopsGoGrafanaInfinibandKubernetesMultusNvidia Gpu OperatorPrometheusPythonRdma
An Hour Ago
Remote or Hybrid
United States
90K-105K Annually
Senior level
90K-105K Annually
Senior level
Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics
Manage large group insurance client relationships with a focus on reporting and metrics. Serve as primary liaison, deliver client reports and insights, lead projects and implementations, drive strategic initiatives, mentor junior staff, and ensure accurate system data and documentation.
Top Skills: ExcelMS OfficeMicrosoft Powerpoint
An Hour Ago
Remote or Hybrid
United States
55K-55K Annually
Mid level
55K-55K Annually
Mid level
Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics
Partner with U100 sales to drive small-business growth by delivering analysis, reports, dashboards, and strategic sales support. Recommend process improvements, correct operational errors, lead projects, train sales on platforms, and support renewals and financial/contract evaluations.
Top Skills: CopilotMS OfficeSalesforce

What you need to know about the Seattle Tech Scene

Home to tech titans like Microsoft and Amazon, Seattle punches far above its weight in innovation. But its surrounding mountains, sprinkled with world-famous hiking trails and climbing routes, make the city a destination for outdoorsy types as well. Established as a logging town before shifting to shipbuilding and logistics, the Emerald City is now known for its contributions to aerospace, software, biotech and cloud computing. And its status as a thriving tech ecosystem is attracting out-of-town companies looking to establish new tech and engineering hubs.

Key Facts About Seattle Tech

  • Number of Tech Workers: 287,000; 13% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Amazon, Microsoft, Meta, Google
  • Key Industries: Artificial intelligence, cloud computing, software, biotechnology, game development
  • Funding Landscape: $3.1 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Madrona, Fuse, Tola, Maveron
  • Research Centers and Universities: University of Washington, Seattle University, Seattle Pacific University, Allen Institute for Brain Science, Bill & Melinda Gates Foundation, Seattle Children’s Research Institute

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account