We are seeking a Business Operations Site Reliability Engineer (SRE).
As a Business Operations SRE, you are responsible for ensuring that our platform is stable and healthy. We break down barriers to run our products by fostering developer run ownership and empowering developers to build resilient products. We support our developers during the application build phase in software run principals that includes operational design, automation, capacity planning, monitoring that leads to fault-tolerant, scalable products. We see the big picture and help create and enforce operations standards while facilitating an agile and learning culture.
We support daily operations with a hyper focus on triage, root cause by understanding the business impact of our products and subsequently performing blameless post-mortems. The goal of every Business Operations team is to engage early in the development lifecycle to be more proactive and upfront in the development process, and to proactively manage production and change activities to maximize customer experience and increase the overall value of supported applications. Business Operations teams also focus on risk management by tying all our activities together with an overarching responsibility for compliance and risk mitigation across all our environments.
Ultimately, the role of Business Operations is to align Product and Customer Focused priorities with Operational needs by providing continuous feedback throughout the lifecycle.
Team Specific Skills:
It is not expected that any single candidate would have expertise across all these areas, but a Biz Ops engineer will spend a bit of time throughout their career with all of these aspects of the role:
Site Reliability Engineering:
o Serve as the primary contact responsible for ensuring application scalability, performance, and resilience.
o Practice sustainable incident response and blameless post-mortems while taking a holistic approach to problem solving and optimizing time to recover.
o Automate data-driven alerts to proactively escalate issues. Work with development teams to establish SLOs and improve reliability.
DevOps/Automation:
o Tackle complex development, automation, and business process problems. Engage in and improve the whole lifecycle of services from inception and design, through deployment, operation, and refinement.
o Support the application CI/CD pipeline for promoting software into higher environments through validation and operational gating, and lead
in DevOps automation and best practices.
o Increase automation and tooling to reduce toil and manual intervention
ITSM Practices:
o Analyses ITSM activities of the platform and provide feedback loop to development teams on operational gaps or resiliency concerns
Role Qualifications
The ideal candidate will have experience in many of these areas:
BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics), or equivalent practical experience.
Coding or scripting exposure.
Appetite for change and pushing the boundaries of what can be done with automation. Be curious about new technology, infrastructure, and practices to scale our architecture and prepare for future growth.
Experience with algorithms, data structures, scripting, pipeline management, and software design
Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
Interest in designing, analysing, and troubleshooting large-scale distributed systems.
Willingness and ability to learn and take on challenging opportunities and to work as a member of matrix based diverse and geographically distributed project team.
Ability to balance doing things right with fixing things quickly. Flexible and pragmatic, while working towards improving the long-term health of the system.
Comfortable collaborating with cross-functional teams to ensure that expected system behaviour is understood and monitoring exists to detect anomalies.
Preferred Qualifications:
Coding experience in one or more of the following: C++, Java, Python, Go
Experience with algorithms, data structures, scripting, pipeline management, and software design.
Experience in working across development, operations, and product teams to prioritize needs and to build relationships is a must.
Experience in a SRE role or related field.
Background on cloud native tooling and orchestration technologies (Kubernetes preferred).
Experience in Monitoring tools such as Splunk, Dynatrace.
Experience with Java, J2EE, WebServices (SOAP/REST), Spring/Spring Boot is a plus.
Experience in production support environments and ITIL processes.
Experience with industry standard CI/CD tools like Git/BitBucket, Jenkins, Maven, Artifactory, Groovy and Chef. Experience designing and implementing an effective and efficient CI/CD flow that gets code from dev to prod with high quality and minimal manual effort is required.
Developing and maintaining cloud solutions on Azure, GCP, or AWS in accordance with best practices.
Understanding of:
o Client-server relationships
o Network concepts (Layer 1 to Layer 3)
o Stack trace analysis (TCP dumps, heap dumps, CPU/memory analysis, thread dumps).
o Load balancers and application firewalls.
o Operating System navigation.
o Logging and monitoring methods, standards, and tools.
o High availability and business continuity planning
o Caching concepts
o Configuration management"
