Site Reliability Engineer

Job description

SRE/Engineering Manager (135k + 15% bonus)

VANRATH is pleased to be working with a global software company who are continuing to grow after a recent acquisition and are looking for a SRE/Engineering Manager (135k + 15% bonus). The company is a market leader in their space (Gartner) and is really expanding across multiple areas within the AI tech space. The company has a remote first approach so the person can be based anywhere in the UK with a great atmosphere and building a really strong team.

The Role
They are looking to build a new team for their archive platform which is their largest product in the company with over 7,000 customers. They are looking to expand that across Europe more and need a new team to provide coverage in the time zone. They have around 30 engineers in the US and they are looking to hire around 35 people this year. They are building a 24/7 operations across the different time zones. They want someone who is passionate about people, and processes and building a culture. They want someone who has come from an engineering background who if really needed could jump in. They have the largest SQL database in production outside Microsoft which has been used in use cases.

Manage day to day operations of our SaaS on-prem platform using a data driven approach to ensuring health and performance remain within our SLAs and SLOs while scaling our data center, public, and private cloud infrastructure.
Manage our incident response and escalation processes, including stakeholder communication, demonstrating continuous improvement as we incorporate feedback into the processes. Demonstrate fully accountability and ownership for platform disruptions and manage incidents through complete resolution.
Adopt SRE/DevOps best practices and apply them to critical initiatives and transformation activities across teams; act as a strategic thought leader in this space for the broader engineering organization and coach the team to develop their skillsets through knowledge sharing, documenting, and acting as a role model for behaviors and attitude.
Lead a large, globally distributed, diverse team to provide around the clock coverage for the platform.
Present to both internal and external audiences on a myriad of topics including roadmap, platform health (key metrics), and all aspects of incidents.
Lead the forecasting, planning, and execution around creating greater scalability, availability, and reliability of the platform, taking into consideration security and observability; a willingness to get hands on at times to lead through doing.
Unrelenting drive for producing increased observability and alerting throughout the platform while guiding the engineering teams toward a more automated, autoscaling, and self-healing architecture.
Lead and manage a growing team of Site Reliability and DevOps engineers fostering a strong team culture, continuously monitoring morale, and consistently delivering quality outcomes.
Champion the move towards Agile/Scrum for the SRE and DevOps organization, creating greater visibility into the day-to-day activities of the team.

The Person:

- - Experience leading Engineering Teams
  - Experience working in cloud native platforms and PaaS
  - Experience working with containers and container orchestration platforms
  - Experience with declarative IaC frameworks: BOSH, Terraform, Puppet/Chef/Ansible/Salt
  - Experience working inside modern observability platforms like:
  - Centralized logging (ELK, Splunk)
  - APM (AppDynamics, Dynatrace, New Relic)
  - Platform telemetry (DataDog, Nagios)
  - Experience working delivering with CI/CD pipelines (Concourse, Bamboo, Jenkins)
  - Strong linux experience
  - Experience with full stack engineering from Java and/or .Net front end services to backend storage systems in both sql and no-sql contexts
  - Strong experience with high scale data steaming(Kafka) and PB scale data stores (Mongo, CEPH)
  - Strong experience in public and private cloud contexts (API driven infra)
  - Experience with search platforms like Elastic and Solr at high scale
  - Strong coding experience in any of the following: java, python, ruby, go
  - Cloud Foundy experience a plus

Manage day to day operations of our SaaS on-prem platform using a data driven approach to ensuring health and performance remain within our SLAs and SLOs while scaling our data center, public, and private cloud infrastructure.
Manage our incident response and escalation processes, including stakeholder communication, demonstrating continuous improvement as we incorporate feedback into the processes. Demonstrate fully accountability and ownership for platform disruptions and manage incidents through complete resolution.
Adopt SRE/DevOps best practices and apply them to critical initiatives and transformation activities across teams; act as a strategic thought leader in this space for the broader engineering organization and coach the team to develop their skillsets through knowledge sharing, documenting, and acting as a role model for behaviors and attitude.
Lead a large, globally distributed, diverse team to provide around the clock coverage for the platform.
Present to both internal and external audiences on a myriad of topics including roadmap, platform health (key metrics), and all aspects of incidents.
Lead the forecasting, planning, and execution around creating greater scalability, availability, and reliability of the platform, taking into consideration security and observability; a willingness to get hands on at times to lead through doing.
Unrelenting drive for producing increased observability and alerting throughout the platform while guiding the engineering teams toward a more automated, autoscaling, and self-healing architecture.
Lead and manage a growing team of Site Reliability and DevOps engineers fostering a strong team culture, continuously monitoring morale, and consistently delivering quality outcomes.
Champion the move towards Agile/Scrum for the SRE and DevOps organization, creating greater visibility into the day-to-day activities of the team.

The Person:

- - Experience leading Engineering Teams
  - Experience working in cloud native platforms and PaaS
  - Experience working with containers and container orchestration platforms
  - Experience with declarative IaC frameworks: BOSH, Terraform, Puppet/Chef/Ansible/Salt
  - Experience working inside modern observability platforms like:
  - Centralized logging (ELK, Splunk)
  - APM (AppDynamics, Dynatrace, New Relic)
  - Platform telemetry (DataDog, Nagios)
  - Experience working delivering with CI/CD pipelines (Concourse, Bamboo, Jenkins)
  - Strong linux experience
  - Experience with full stack engineering from Java and/or .Net front end services to backend storage systems in both sql and no-sql contexts
  - Strong experience with high scale data steaming(Kafka) and PB scale data stores (Mongo, CEPH)
  - Strong experience in public and private cloud contexts (API driven infra)
  - Experience with search platforms like Elastic and Solr at high scale
  - Strong coding experience in any of the following: java, python, ruby, go
  - Cloud Foundy experience a plus