Contract - 6 months
Sydney
Hybrid
We're looking for an experienced Senior Site Reliability Engineer to join a high-impact digital engineering team supporting one of Australia's most widely used customer-facing eCommerce applications.
This role is all about driving platform stability, performance, and scalability across a complex Azure and Kubernetes environment. You'll take ownership of monitoring, performance optimisation, and automation initiatives that ensure the digital platforms run smoothly.
Looking for a hands-on engineer who can hit the ground running, work autonomously, and help shape the platform strategy for the future.
Key Responsibilities
Maintain and improve the reliability, performance, and scalability of large-scale customer-facing applications.
Manage and optimise Azure Kubernetes Service (AKS) clusters ensuring cost efficiency and right-sizing at scale.
Implement and refine monitoring, alerting, and observability using tools such as Dynatrace and Azure-native monitoring solutions.
Identify and reduce unnecessary logs and alerts to improve signal-to-noise ratio and platform insight.
Work closely with software engineering teams (primarily .NET and GraphQL stacks) to diagnose performance issues and improve application behaviour within the clusters.
Collaborate on platform automation - driving efficiency and consistency through Infrastructure as Code and CI/CD pipelines.
Contribute to defining and executing the platform strategy to ensure reliability, maintainability, and scalability across digital services.
Take ownership of incident response, post-mortem analysis, and ongoing performance tuning.
Support and optimise Microsoft SQL environments that underpin core application services.
7+ years' experience in Site Reliability Engineering, DevOps, or Platform Engineering roles.
Proven experience running and optimising Azure AKS clusters in production at scale.
Strong background in application performance tuning and monitoring/alerting frameworks (preferably Dynatrace).
Familiarity with .NET and GraphQL application architectures, and an ability to collaborate effectively with development teams to diagnose issues.
Strong SQL Server (MSQL) experience for performance monitoring and troubleshooting.
Deep understanding of observability, logging, metrics, and tracing best practices.
Hands-on experience with automation, scripting, and Infrastructure as Code (PowerShell, Terraform, ARM templates, etc.).
A proactive mindset focused on platform stability, cost optimisation, and continuous improvement.
Excellent communication skills and the ability to work independently with minimal guidance.
