Site Reliability Engineer - II
Lead team initiatives to continuously refine our AWS deployment practices for improved reliability, repeatability and security. These high-visibility initiatives will help to increase service levels, lower costs, and deliver features more quickly.
Design effective monitoring / alerting (for conditions such as application-errors, high memory usage) and log aggregation approaches (to quickly access logs for troubleshooting, or generate reports for trend analysis) to proactively notify business stakeholders of issues and communicate metrics, working closely with these stakeholders, using tools including AWS CloudWatch, Datadog, ClearData etc.
Write code and scripts to automate provisioning of AWS services and to configure services, using tools and languages including AWS CLI / API, Terraform, Ansible, Chef, Python, Bash, and Git.
Troubleshoot issues in production and other environments, applying debugging and problem-solving techniques (e.g., log analysis, non-invasive tests) , working closely with development and product teams.
Suggest deployment patterns & practices improvements based on learnings from past deployments and production issues; collaborate with DevOps team to implement these.
Promote a DevOps culture, including building relationships with other technical and business teams.
Work closely with InterOps to deploy and configure the platform to on-board clinics.