Akuity

Senior Site Reliability Engineer

Posted 2 Days Ago

Easy Apply

In-Office or Remote

Hiring Remotely in Time, IL

Senior level

Easy Apply

In-Office or Remote

Hiring Remotely in Time, IL

Senior level

As a Senior Site Reliability Engineer, you will enhance platform reliability, own SLIs/SLOs, design observability systems, and lead incident response and improvements.

The summary above was generated by AI

About Akuity

With the move to the cloud, Kubernetes has become widely adopted by DevOps and Platform Engineering teams, but it has also added complexity. While scaling Kubernetes at Intuit, the Akuity founders started building Argo CD in order to streamline the adoption of Kubernetes. Argo CD helps developers own, understand and deploy their K8s deployments via GitOps.

Today, Argo CD is the third most popular project in the CNCF (Cloud Native Computing Foundation) and is used by 70% of companies who are using Kubernetes in production. The list of Argo CD users includes companies like Intuit, BlackRock, Tesla, Major League Baseball, Peloton, and many more.

The team founded Akuity in 2021 to enable enterprises to ship software faster and more reliably with modern GitOps best practices. The Akuity Platform enables teams to manage the development and deployment across hundreds – if not thousands – of Kubernetes clusters from a single control plane. Trusted by top companies around the globe, the Akuity Platform provides the only end-to-end GitOps platform for the enterprises.

Our mission is to simplify the software delivery process so that DevOps and Platform Engineering teams can move fast, and deploy code effortlessly without the fear of breaking things.

The Role

We are looking for a Senior SRE to help us keep the Akuity platform running at the level our enterprise customers expect. This is a high-ownership role; you won't just respond to incidents, you'll shape how we define and defend reliability across the entire platform. You'll work closely with engineering, infrastructure, and product to build the systems and culture that let us scale with confidence.

What You'll OwnPlatform Reliability & SLAs

Own SLI/SLO/SLA definitions for the Akuity SaaS platform and drive continuous improvement against them
Design, instrument, and maintain observability systems (metrics, logs, traces) across multi-region AWS infrastructure
Identify reliability gaps, lead blameless post-mortems, and close the loop with permanent fixes
Partner with engineering teams to build reliability into new features before they ship to production

On-Call & Incident Response

Participate in an on-call rotation and act as incident commander for high-severity production events
Build and maintain runbooks, escalation paths, and incident playbooks that keep mean time to resolution low
Drive improvements to alerting fidelity; reduce noise, increase signal, eliminate toil
Lead post-incident reviews with clear timelines, root cause analysis, and follow-through on action items

What We're Looking ForRequired

5+ years of SRE, platform engineering, or production operations experience in a SaaS environment
Deep hands-on Kubernetes expertise; you understand the scheduler, networking, storage, and autoscaling at a level where you can debug anything
Strong AWS fundamentals across compute (EC2, EKS), networking (VPC, NLB, Route53), storage (S3, RDS), and IAM
Experience defining and operating against SLOs in production; you've written error budgets, not just read about them
Proficiency with observability tooling (Prometheus, Grafana, OpenTelemetry, Datadog, or equivalent)
Solid scripting and automation skills; Go, Python, Bash, or similar; you automate what you touch
Strong written communication: clear runbooks, sharp incident reports, thoughtful post-mortems
Live within US time zones (Pacific through Eastern), including Canada and other regions

Strong Advantage

Experience with Argo CD, Kargo, or GitOps-based delivery workflows
Familiarity with multi-region, multi-cluster Kubernetes deployments
Experience with compliance-adjacent infrastructure (SOC 2, ISO 27001, HIPAA, or PCI DSS)
Background operating infrastructure for other platform or developer tooling companies

Our Stack

Kubernetes (EKS): multi-region, enterprise-grade clusters serving Argo CD and Kargo workloads
AWS: primary cloud provider across all production and DR environments
Argo CD & Kargo: GitOps delivery tools we build and run ourselves
Prometheus, Grafana, and OpenTelemetry for observability
Terraform and GitOps-driven infrastructure management

What We Offer

Competitive compensation, commensurate with experience
Equity participation in a well-funded, growing company
Fully remote: work from anywhere within US time zones (Pacific through Eastern), including Canada and other regions
Home office stipend and equipment budget
Flexible time off and a culture that respects it
Work directly with the engineers who built Argo CD and Kargo; you'll learn a lot here

US-based employees receive full benefits, including comprehensive health, dental, and vision coverage. Candidates based outside the US will be engaged as contractors.

Top Skills

Argo Cd

AWS

Bash

Grafana

Kubernetes

Opentelemetry

Prometheus

Python

Terraform

Similar Jobs

MongoDB

Senior Site Reliability Engineer

2 Days Ago

Easy Apply

Remote or Hybrid

Easy Apply

127K-249K Annually

Senior level

127K-249K Annually

Senior level

Big Data • Cloud • Software • Database

Develop and maintain Kubernetes runtime environments, support developers, resolve critical issues, and participate in on-call rotations for production systems.

Top Skills: AWSAzureCert-ManagerCorednsCrdsCriCsiGatekeeperGCPGoHelmKubernetesKustomizeOperatorsPythonTerraform

DFIN

Senior Site Reliability Engineer

2 Days Ago

Remote or Hybrid

United States

Senior level

Fintech • Software

The Senior Site Reliability Engineer ensures fast, stable SaaS products through automation, collaboration, monitoring, and implementing AI tools to enhance performance and reliability.

Top Skills: Ai ToolsAnsibleAppdynamicsAWSAzureAzure DevopsBashC# .NetCosmosDatadogDynatraceHarnessJavaJenkinsKubernetesNew RelicPowershellPythonSaaSSQLTerraform

Coinbase

Senior Site Reliability Engineer

16 Days Ago

Easy Apply

Remote

USA

Easy Apply

186K-219K Annually

Senior level

186K-219K Annually

Senior level

Artificial Intelligence • Blockchain • Fintech • Financial Services • Cryptocurrency • NFT • Web3

The role involves supporting network infrastructure, automating cloud infrastructure, managing CI/CD workflows, and ensuring operational excellence in IT support, including incident response and security practices.

Top Skills: AnsibleAWSBashDockerGitKubernetesPythonRubyTerraform

What you need to know about the Seattle Tech Scene

Home to tech titans like Microsoft and Amazon, Seattle punches far above its weight in innovation. But its surrounding mountains, sprinkled with world-famous hiking trails and climbing routes, make the city a destination for outdoorsy types as well. Established as a logging town before shifting to shipbuilding and logistics, the Emerald City is now known for its contributions to aerospace, software, biotech and cloud computing. And its status as a thriving tech ecosystem is attracting out-of-town companies looking to establish new tech and engineering hubs.

Key Facts About Seattle Tech

Number of Tech Workers: 287,000; 13% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Amazon, Microsoft, Meta, Google
Key Industries: Artificial intelligence, cloud computing, software, biotechnology, game development
Funding Landscape: $3.1 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Madrona, Fuse, Tola, Maveron
Research Centers and Universities: University of Washington, Seattle University, Seattle Pacific University, Allen Institute for Brain Science, Bill & Melinda Gates Foundation, Seattle Children’s Research Institute