SalaryPeak

Site Reliability Engineer

ELLIOTT MOSS CONSULTING PTE. LTD.
Singapore 4+ years Posted Jan 21, 2026

Salary Range

SGD 62,400 - SGD 78,000 /year

SGD 5,200 - SGD 6,500/month

Apply on MyCareersFuture

Skills Required

SamplingTroubleshootingRemediationScalabilityKubernetesPipelinesReliabilityTuningLoggingReliability EngineeringKibanaAuthenticationLinuxIncident Management

Job Description

Job Description

·      We are looking for a highly skilled Site Reliability Engineer (SRE) to own and evolve our enterprise observability and reliability platforms. 

·      This role is responsible for ensuring availability, performance, scalability, and reliability of large-scale, cloud-native applications running on Kubernetes and OpenShift.

·       The SRE will partner closely with application and platform teams to embed reliability engineering, SLO-driven operations, and automation-first practices. 

Key Responsibilities

·      Reliability Engineering & SRE Practices Define, implement, and continuously improve SLIs, SLOs, and error budgets for enterprise applications.

·       Drive reliability-focused decision making using error budgets, MTTD, MTTR, and service health metrics. 

·      Proactively identify reliability risks and performance bottlenecks and drive remediation. 

·      Lead incident response, post-incident reviews (blameless postmortems), and reliability improvements. 

·      Observability Platform Ownership Own and operate open-source–based observability platforms covering metrics, logging, and distributed tracing. 

·      Enhance, optimize, and migrate observability solutions to improve scalability, resilience, and cost efficiency.

·       Maintain and tune Prometheus and other TSDBs, including cardinality management and resource optimization. 

·      Operate distributed tracing platforms such as OpenTelemetry, Jaeger, and Zipkin, including tuning sampling strategies and troubleshooting microservices traces. 

·      Kubernetes & OpenShift Reliability Support and enable application teams to migrate workloads to newer OpenShift/Kubernetes versions.

·       Deploy, manage, and troubleshoot stateful and stateless workloads on Kubernetes platforms. 

·      Improve platform reliability through automation, self-healing, and standardized deployment patterns.

·       Partner with developers to implement application instrumentation and reliability best practices.

·       Logging, Alerting & Incident Response Operate enterprise logging platforms such as ELK Stack and Grafana Loki, including Elasticsearch cluster management and index lifecycle management.

·       Design and maintain actionable alerting aligned to SLOs and business impact. 

·      Integrate alerting platforms with PagerDuty, Microsoft Teams, and other incident management tools.

·       Reduce alert fatigue by implementing alert hygiene and signal-to-noise optimization.

·       Dashboards & Service Visibility Deploy and administer visualization tools such as Grafana and Kibana.

·       Create standardized, reusable dashboards for service health, reliability, and capacity planning.

·       Implement and manage RBAC across observability platforms.

·       Infrastructure, Security & Automation Troubleshoot observability infrastructure issues across Linux VMs and Kubernetes pods.

·       Secure observability and platform endpoints using TLS, reverse proxies, and authentication mechanisms (MFA, LDAPS, OAuth).

·       Build and maintain CI/CD pipelines for observability and reliability tooling. 

·      Extend pipelines to support multiple environments and regions with consistency and repeatability.

·       Reliability Culture & Enablement Champion an SRE and observability-first culture across engineering teams. 

·      Coach teams on golden signals, service health modeling, and reliability trade-offs.

·       Enable teams to move from reactive monitoring to proactive reliability engineering. 

Required Skills & Experience

·      Core Technical Skills Strong hands-on experience with: 1.Prometheus, Grafana 2.Elasticsearch, Kibana (cluster operations, ILM, tuning) 3.OpenTelemetry, Jaeger, Zipkin 4.Kubernetes & OpenShift 5.Linux OS troubleshooting 6.CI/CD pipelines and automation 

·      Solid understanding of SRE principles, including SLIs, SLOs, error budgets, and incident management. 

·      Experience supporting production, highly available, distributed systems.

Working Hours Monday to Friday: 9:00 AM – 6:00 PM Occasional weekend support may be required for critical deployments or incidents; compensatory off will be provided.