Chess.com Logo

Chess.com

Site Reliability Engineer

Posted 23 Days Ago
Remote
Hiring Remotely in USA
Senior level
Remote
Hiring Remotely in USA
Senior level
The Site Reliability Engineer will manage infrastructure stability and scalability, lead cloud migrations, and optimize performance across systems while mentoring team members.
The summary above was generated by AI

About Us

Chess.com is one of the largest gaming sites in the world and the #1 platform for playing, learning, and enjoying chess.


We are a team of 600+ fully remote people in 60+ countries working hard to serve the global chess community. We are here to support 250M+ chess players worldwide with the best possible product, content, and tools to serve the community!


We are a tech company. A gaming company. A content company. And we do it all with passion and commitment to the game. Above all we prize our mission-driven, flat, life-celebrating, no-corporate culture, and we look forward to meeting you and learning more about what you can bring to the team.



About the role

The Site Reliability Engineer will play a critical role in ensuring the stability, performance, and scalability of our global gaming platform infrastructure. This position exists to bridge the gap between development and operations, maintaining high availability for millions of concurrent users while supporting rapid feature development and deployment. The SRE will be instrumental in building resilient systems that can handle massive scale across multiple regions, directly impacting user experience and platform reliability.


As our platform continues to grow and serve a global community, this role will drive the technical infrastructure decisions that enable seamless gaming experiences. The position requires both deep technical expertise and collaborative leadership to work across engineering teams, ensuring our systems can scale efficiently while maintaining the performance standards our users expect.



What you'll do

  • Design and implement multi-regional resilient infrastructure capable of handling millions of concurrent sessions and transactions daily across global data centers
  • Lead the hybrid cloud migration strategy, integrating bare-metal datacenter resources with cloud services for optimal performance and cost efficiency
  • Own the on-call rotation and incident response procedures, ensuring rapid resolution of critical system issues and maintaining high availability SLAs
  • Architect monitoring and alerting systems using industry-standard tools to proactively identify and resolve performance bottlenecks before they impact users
  • Collaborate with development teams to implement infrastructure-as-code practices and establish deployment pipelines that support continuous integration and delivery
  • Optimize system performance through capacity planning, load testing, and resource allocation across distributed computing environments
  • Establish and maintain security protocols and risk assessment procedures for infrastructure components and data protection
  • Partner with engineering teams to design scalable solutions for high-traffic applications and real-time processing requirements
  • Drive automation initiatives to reduce manual operational overhead and improve system reliability through scripting and configuration management
  • Mentor team members on SRE best practices and contribute to the development of infrastructure standards and documentation



Preferred Skills

  • Bachelor's degree in Computer Science, Engineering, or related technical field, or equivalent practical experience
  • 5+ years of experience in site reliability engineering, DevOps, or infrastructure engineering roles
  • Experience managing bare-metal server infrastructure and datacenter operations
  • Strong proficiency with UNIX/Linux operating systems and command-line administration
  • Experience with cloud platforms (GCP, AWS, or Azure) and infrastructure-as-code tools (Terraform, CloudFormation, or similar)
  • Hands-on experience with configuration management systems (Ansible, Chef, Puppet, or similar)
  • Solid understanding of networking fundamentals, protocols (TCP/IP, HTTP/HTTPS, DNS), and network troubleshooting
  • Experience with containerization and orchestration technologies (Docker, Kubernetes, or similar)
  • Proficiency with monitoring and observability tools (Datadog, Prometheus, Grafana, ELK stack, or similar)
  • Experience with relational and NoSQL databases, including performance optimization and scaling strategies
  • Strong collaboration and communication skills for working effectively in a distributed team environment
  • Demonstrated sense of ownership and accountability for system reliability and performance



Nice to have
  • Advanced knowledge of content delivery networks (CDNs) and edge computing
  • Experience with server-side automation and scripting languages (Python, Go, Bash, or similar)
  • Background in high-availability architectures and disaster recovery planning
  • Familiarity with security frameworks and compliance requirements
  • Experience with game server infrastructure or real-time application hosting
  • Knowledge of database administration and optimization for high-concurrency applications
  • Understanding of CI/CD pipelines and deployment automation
  • Experience with capacity planning and performance testing tools
  • Previous experience in a fully remote, distributed work environment
  • Continuous learning mindset with interest in emerging infrastructure technologies



About the Opportunity

  • This is a full-time opportunity
  • We are 100% remote (work from anywhere!)

---

You can learn more about us here:

  • https://www.chess.com/article/view/how-chess-com-virtual-team-works-together
  • https://www.chess.com/about

Similar Jobs

Yesterday
In-Office or Remote
Expert/Leader
Expert/Leader
Artificial Intelligence • Big Data • Healthtech • Information Technology • Machine Learning • Software • Analytics
Define and scale SRE standards across teams, implement SLOs/SLIs/error budgets, build observability and resiliency patterns, drive automation and AIOps, improve reliability for large-scale Azure cloud systems, and influence engineering and platform teams.
Top Skills: Ai/MlAiopsAutomationAzureError BudgetsIncident ManagementLogsObservability (MetricsOpentelemetrySlisSlosTracing)
2 Days Ago
Easy Apply
Remote or Hybrid
New Jersey, USA
Easy Apply
127K-249K Annually
Senior level
127K-249K Annually
Senior level
Big Data • Cloud • Software • Database
Maintain and improve multi-cloud Kubernetes infrastructure, CI/CD (Argo Workflows/ArgoCD), observability, and networking. Build reliable continuous deployment tooling and onboarding flows, provide internal support, collaborate across Platform Engineering, contribute upstream (open-source/operators), and participate in a 24/7 on-call rotation to resolve deployment infrastructure issues.
Top Skills: AlertingArgo WorkflowsArgocdAWSAzureCi/CdContainersDnsGCPGoKubernetesLinuxLoad BalancerObservabilityPythonService MeshTcp/IpTls
3 Days Ago
Easy Apply
Remote or Hybrid
US
Easy Apply
200K-230K Annually
Senior level
200K-230K Annually
Senior level
Artificial Intelligence • Machine Learning
Lead development of AI-assisted reliability tooling, own incident response end-to-end, improve observability and SLO/SLI frameworks, scale single-tenant SaaS operations, mentor engineers, and reduce recurring operational toil through engineering and automation.
Top Skills: Cloud PlatformsGoKubernetesLinuxLlm/Ai ToolingLogs And TracingObservability ToolingPythonSlo/Sli Frameworks

What you need to know about the Seattle Tech Scene

Home to tech titans like Microsoft and Amazon, Seattle punches far above its weight in innovation. But its surrounding mountains, sprinkled with world-famous hiking trails and climbing routes, make the city a destination for outdoorsy types as well. Established as a logging town before shifting to shipbuilding and logistics, the Emerald City is now known for its contributions to aerospace, software, biotech and cloud computing. And its status as a thriving tech ecosystem is attracting out-of-town companies looking to establish new tech and engineering hubs.

Key Facts About Seattle Tech

  • Number of Tech Workers: 287,000; 13% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Amazon, Microsoft, Meta, Google
  • Key Industries: Artificial intelligence, cloud computing, software, biotechnology, game development
  • Funding Landscape: $3.1 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Madrona, Fuse, Tola, Maveron
  • Research Centers and Universities: University of Washington, Seattle University, Seattle Pacific University, Allen Institute for Brain Science, Bill & Melinda Gates Foundation, Seattle Children’s Research Institute

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account