Principal SRE (£110-£115k)
VANRATH are pleased to be working with a very exciting company with the financial services space who are looking for a Principal SRE (110-115k). This one of the most interesting companies to come to NI over the past number of years due to the fact that the product and engineering team will be ran out of NI. This is not a company where the US get all the glamorous work and you are the poor cousins picking up the scraps but at the forefront of decision making and cutting edge decisions moving forward. In addition the company is ran with a remote first working environment giving you money to set up a home office and really enjoy a great work/life balance.
The Engineering Team is looking for a highly motivated individual to assist us in growing our Site Reliability function. This individual will play a critical role in the design and development of our tooling, monitoring, control, self-service reporting, and analysis approach. Also, they will establish policies and procedures governing our incident, change, and problem management protocols.
While continuing to evolve the function, primary responsibilities will be monitoring and remediating systems, security, and network issues using various application and network management tools. The Engineer will also be responsible for interfacing with internal/external customers on operational issues by dispatching on-call engineers, facilitating communication, and driving resolution to events via standard operating procedures. Additional responsibilities will require tracking escalations and other key performance indicators and provide application administration in a 24x7 environment based on root cause and analysis of logs, alerts, and various other diagnostic tools. Teamwork will play a significant role in working effectively with peers to create a positive environment within the team. The ideal candidate will have excellent written and verbal communication skills along with the ability to multitask and facilitate the resolution of multiple incidents at any given time, including the vision to automate so to better scale and streamline the process.
Duties and Responsibilities
- Architecting and developing solutions and roadmaps for monitoring of various systems that constitute the Payroc operating environment and leveraging such telemetry in an IT setting for alert response and troubleshooting.
- Work with Architecture, Security, Development, Systems Engineering, and Operations teams to develop innovative solutions to attain high availability scalability and reliability.
- Provide technical leadership and do technical hands-on scripting, tooling, automation for continuous operations.
- Detect incidents based on monitoring tools, notifications, and log files.
- Develop new and modify existing monitors as needed.
- Triage incidents and perform documented steps to resolve when a known error is identified.
- Logging incidents within the Incident Tracking system, clearly documenting symptoms needed for others to investigate the incident.
- Act as "incident owner," escalating to other support groups and following the status of the incident until it has been confirmed to be resolved.
- Work closely with technical support, security, engineers, customers, and other groups as needed to narrow investigative efforts and resolve incidents.
- Monitor running jobs for operational impact. Identify scheduled job failures.
- Maintain critical documentation assets, such as customer contact lists, escalation procedures, scheduled job inventories, and operational "run-books."
- Provide support via phone or pager on a scheduled basis as part of an on-call rotation
- 10+ years in a Senior technical role - DevOps, Software Engineering, System or Support Engineering position.
- Demonstrated experience designing, installing and configuring monitoring solutions - ideally for mission-critical, 24x7 environments.
- Solid understanding of monitoring fundamentals associated with SNMP, WMI, Synthetic Transaction Engines and experience with various commercial, open source and homegrown monitoring packages and methods (e.g., Splunk, Nagios, Zabbix, OneSite, Gomez, CA, HP Openview, etc.).
- Strong scripting skills with languages such as Powershell or Python.
- Understanding of Object Oriented languages such as C# or Java
- Solid understanding of application level monitoring tools and technicques, including Open Tracing, Open Telemetry and APM tools (e.g. Elastic, DataDog, New Relic etc.)
- Solid understanding of networking, including network devices, subnets, and routing protocols; ability to take and interpret packet captures (Ethereal, etc.).
- Solid understanding of systems, including server hardware, Windows and Linux operating systems, iSCSI/FC SAN/NAS/DAS storage, Hypervisor/Virtualization (VMware, Hyper-V).
- Proficiency in AD/DNS/DHCP.
- Independently implement and build tools and test significant features and capabilities, as well as work jointly with other team members on complex site issues.
- Outstanding written and verbal communication and interpersonal attributes.
- Strong technology aptitude, and leadership.
- Clear and concise communication, both written and oral.
- Excellent analytical and troubleshooting skills.
- Ability to operate in a fast-paced, mission-critical environment.
The salary for this role is negotiable depending on the level of experience but around the £110-115k mark. The company also offers a great benefits package. It also comes with the expenses in order to help you set up your office environment at home.
To find out more information on the role advertised, please submit your CV or contact Matthew Evers at VANRATH in the strictest confidence.