Job Description

Job Description

· We are looking for a highly skilled Site Reliability Engineer (SRE) to own and evolve our enterprise observability and reliability platforms.

· This role is responsible for ensuring availability, performance, scalability, and reliability of large-scale, cloud-native applications running on Kubernetes and OpenShift.

· The SRE will partner closely with application and platform teams to embed reliability engineering, SLO-driven operations, and automation-first practices.

Key Responsibilities

· Reliability Engineering & SRE Practices Define, implement, and continuously improve SLIs, SLOs, and error budgets for enterprise applications.

· Drive reliability-focused decision making using error budgets, MTTD, MTTR, and service health metrics.

· Proactively identify reliability risks and performance bottlenecks and drive remediation.

· Lead incident response, post-incident reviews (blameless postmortems), and reliability improvements.

· Observability Platform Ownership Own and operate open-source–based observability platforms covering metrics, logging, and distributed tracing.

· Enhance, optimize, and migrate observability solutions to improve scalability, resilience, and cost efficiency.

· Maintain and tune Prometheus and other TSDBs, including cardinality management and resource optimization.

· Operate distributed tracing platforms such as OpenTelemetry, Jaeger, and Zipkin, including tuning sampling strategies and troubleshooting microservices traces.

· Kubernetes & OpenShift Reliability Support and enable application teams to migrate workloads to newer OpenShift/Kubernetes versions.

· Deploy, manage, and troubleshoot stateful and stateless workloads on Kubernetes platforms.

· Improve platform reliability through automation, self-healing, and standardized deployment patterns.

· Partner with developers to implement application instrumentation and reliability best practices.

· Logging, Alerting & Incident Response Operate enterprise logging platforms such as ELK Stack and Grafana Loki, including Elasticsearch cluster management and index lifecycle management.

· Design and maintain actionable alerting aligned to SLOs and business impact.

· Integrate alerting platforms with PagerDuty, Microsoft Teams, and other incident management tools.

· Reduce alert fatigue by implementing alert hygiene and signal-to-noise optimization.

· Dashboards & Service Visibility Deploy and administer visualization tools such as Grafana and Kibana.

· Create standardized, reusable dashboards for service health, reliability, and capacity planning.

· Implement and manage RBAC across observability platforms.

· Infrastructure, Security & Automation Troubleshoot observability infrastructure issues across Linux VMs and Kubernetes pods.

· Secure observability and platform endpoints using TLS, reverse proxies, and authentication mechanisms (MFA, LDAPS, OAuth).

· Build and maintain CI/CD pipelines for observability and reliability tooling.

· Extend pipelines to support multiple environments and regions with consistency and repeatability.

· Reliability Culture & Enablement Champion an SRE and observability-first culture across engineering teams.

· Coach teams on golden signals, service health modeling, and reliability trade-offs.

· Enable teams to move from reactive monitoring to proactive reliability engineering.

Required Skills & Experience

· Core Technical Skills Strong hands-on experience with: 1.Prometheus, Grafana 2.Elasticsearch, Kibana (cluster operations, ILM, tuning) 3.OpenTelemetry, Jaeger, Zipkin 4.Kubernetes & OpenShift 5.Linux OS troubleshooting 6.CI/CD pipelines and automation

· Solid understanding of SRE principles, including SLIs, SLOs, error budgets, and incident management.

· Experience supporting production, highly available, distributed systems.

Working Hours Monday to Friday: 9:00 AM – 6:00 PM Occasional weekend support may be required for critical deployments or incidents; compensatory off will be provided.

Site Reliability Engineer

Skills Required

Job Description

About ELLIOTT MOSS CONSULTING PTE. LTD.

Similar Jobs

Senior Database Administrator

DevSecOps Engineer

Data Integration Analyst

Senior Data Engineer – Databricks

Cloud Engineer