Salary Range
SGD 62,400 - SGD 78,000 /year
SGD 5,200 - SGD 6,500/month
Skills Required
Job Description
Job Description
· We are looking for a highly skilled Site Reliability Engineer (SRE) to own and evolve our enterprise observability and reliability platforms.
· This role is responsible for ensuring availability, performance, scalability, and reliability of large-scale, cloud-native applications running on Kubernetes and OpenShift.
· The SRE will partner closely with application and platform teams to embed reliability engineering, SLO-driven operations, and automation-first practices.
Key Responsibilities
· Reliability Engineering & SRE Practices Define, implement, and continuously improve SLIs, SLOs, and error budgets for enterprise applications.
· Drive reliability-focused decision making using error budgets, MTTD, MTTR, and service health metrics.
· Proactively identify reliability risks and performance bottlenecks and drive remediation.
· Lead incident response, post-incident reviews (blameless postmortems), and reliability improvements.
· Observability Platform Ownership Own and operate open-source–based observability platforms covering metrics, logging, and distributed tracing.
· Enhance, optimize, and migrate observability solutions to improve scalability, resilience, and cost efficiency.
· Maintain and tune Prometheus and other TSDBs, including cardinality management and resource optimization.
· Operate distributed tracing platforms such as OpenTelemetry, Jaeger, and Zipkin, including tuning sampling strategies and troubleshooting microservices traces.
· Kubernetes & OpenShift Reliability Support and enable application teams to migrate workloads to newer OpenShift/Kubernetes versions.
· Deploy, manage, and troubleshoot stateful and stateless workloads on Kubernetes platforms.
· Improve platform reliability through automation, self-healing, and standardized deployment patterns.
· Partner with developers to implement application instrumentation and reliability best practices.
· Logging, Alerting & Incident Response Operate enterprise logging platforms such as ELK Stack and Grafana Loki, including Elasticsearch cluster management and index lifecycle management.
· Design and maintain actionable alerting aligned to SLOs and business impact.
· Integrate alerting platforms with PagerDuty, Microsoft Teams, and other incident management tools.
· Reduce alert fatigue by implementing alert hygiene and signal-to-noise optimization.
· Dashboards & Service Visibility Deploy and administer visualization tools such as Grafana and Kibana.
· Create standardized, reusable dashboards for service health, reliability, and capacity planning.
· Implement and manage RBAC across observability platforms.
· Infrastructure, Security & Automation Troubleshoot observability infrastructure issues across Linux VMs and Kubernetes pods.
· Secure observability and platform endpoints using TLS, reverse proxies, and authentication mechanisms (MFA, LDAPS, OAuth).
· Build and maintain CI/CD pipelines for observability and reliability tooling.
· Extend pipelines to support multiple environments and regions with consistency and repeatability.
· Reliability Culture & Enablement Champion an SRE and observability-first culture across engineering teams.
· Coach teams on golden signals, service health modeling, and reliability trade-offs.
· Enable teams to move from reactive monitoring to proactive reliability engineering.
Required Skills & Experience
· Core Technical Skills Strong hands-on experience with: 1.Prometheus, Grafana 2.Elasticsearch, Kibana (cluster operations, ILM, tuning) 3.OpenTelemetry, Jaeger, Zipkin 4.Kubernetes & OpenShift 5.Linux OS troubleshooting 6.CI/CD pipelines and automation
· Solid understanding of SRE principles, including SLIs, SLOs, error budgets, and incident management.
· Experience supporting production, highly available, distributed systems.
Working Hours Monday to Friday: 9:00 AM – 6:00 PM Occasional weekend support may be required for critical deployments or incidents; compensatory off will be provided.
About ELLIOTT MOSS CONSULTING PTE. LTD.
Similar Jobs
Senior Database Administrator
ELLIOTT MOSS CONSULTING PTE. LTD.
SGD 84,000 - SGD 108,000/yr
DevSecOps Engineer
ELLIOTT MOSS CONSULTING PTE. LTD.
SGD 132,000 - SGD 147,600/yr
Data Integration Analyst
ELLIOTT MOSS CONSULTING PTE. LTD.
SGD 111,600 - SGD 129,600/yr
Senior Data Engineer – Databricks
ELLIOTT MOSS CONSULTING PTE. LTD.
SGD 115,200 - SGD 132,000/yr
Cloud Engineer
ELLIOTT MOSS CONSULTING PTE. LTD.
SGD 90,000 - SGD 114,000/yr