Job Description

OVERVIEW

We’re hiring a Site Reliability Engineer to support a key global technology client. You’ll join a modern, cloud‑native engineering environment and partner closely with development teams to improve the reliability, scalability, and automation of distributed platforms. The role blends software engineering with reliability ownership: you’ll design and build internal services and tooling, streamline CI/CD, implement Infrastructure‑as‑Code at scale, and strengthen observability so issues are found and fixed before they impact users.

This position offers high autonomy and visibility. You’ll work across well‑documented systems and established tooling, prepare proof‑of‑concepts to influence change, and drive pragmatic automation (in Go or Python) that reduces manual effort and makes releases safer and faster. If you enjoy hands‑on engineering, diagnosing complex problems, and landing improvements in real production environments, this is an opportunity to make a clear and measurable impact.

DESCRIPTION

As a Site Reliability Engineer, you will:

Build internal platforms, services, and APIs that enable self‑service provisioning, safe deployments, and efficient day‑to‑day operations.
Enhance CI/CD workflows (e.g., Jenkins or similar) to increase deployment reliability, add guardrails, and improve developer experience and velocity.
Implement and evolve Infrastructure‑as‑Code using Terraform (and related patterns) to standardize environments, reduce configuration drift, and improve repeatability.
Define and operationalize SLIs/SLOs and error budgets, build actionable dashboards, and tune alerts to reflect user experience and business risk.
Operate Kubernetes workloads at scale; improve resilience, performance, and cost‑efficiency through sound engineering and automation.
Strengthen observability (metrics, logs, traces) using Prometheus and complementary platforms; drive root‑cause analysis and preventative fixes.
Automate routine work and periodic upgrade cycles (preferably in Go/Python) to eliminate toil and reduce change risk.
Troubleshoot complex incidents across compute, networking, containers, and deployments; participate in a shared on‑call rotation and contribute to post‑incident reviews.
Collaborate with engineers, architects, and product stakeholders to translate requirements into secure, observable, and scalable infrastructure solutions.
Document patterns and best practices; mentor teams on reliability‑first ways of working and platform standards.

QUALIFICATIONS

Strong hands‑on experience with AWS (production environments) and cloud‑native architectures; familiarity with hybrid or multi‑cloud concepts is a plus.
Practical expertise operating Kubernetes (deployments, day‑2 operations, and troubleshooting).
Solid CI/CD skills with Jenkins or similar tools (pipeline design, release safety, rollbacks).
Proficiency in Infrastructure‑as‑Code (Terraform) and Git‑based workflows for environment management.
Programming/automation in Go and/or Python (production‑quality code; tooling and services, not just scripts).
Observability experience with Prometheus and dashboards/alerting tuned to SLIs/SLOs; familiarity with platforms such as Grafana, Datadog, or CloudWatch is welcome.
Networking fundamentals for distributed systems, DNS, load balancing, VPC design, security groups, and layer‑7 routing/proxies.
Sound understanding of secure system design (least privilege, secrets management, change control) and performance/reliability trade‑offs.
Excellent communication skills and the ability to operate independently in distributed, asynchronous teams while influencing stakeholders through clear proposals and POCs.
7+ years in SRE/DevOps/Infrastructure/Software Engineering with a track record of operating production‑grade systems at scale.

PROFESSIONAL ATTRIBUTES

Ownership: You’re accountable across both build and run; you close the loop with measurable outcomes.
Automation first: You remove toil with durable solutions, not quick fixes.
Engineering rigor: You apply design patterns, testing, and code reviews to platform work.
Influence without authority: You use documentation, POCs, and calm communication to align teams.
Proactive and visible: You work independently across time zones and keep stakeholders informed.

We regret to inform that only shortlisted candidates will be notified / contacted.

EA Registration No: R21103843, Andrew Jonas Matthew

Allegis Group Singapore Pte Ltd, Company Reg No. 200909448N, EA License No. 10C4544

Site Reliability Engineer

Skills Required

Job Description

About ALLEGIS GROUP SINGAPORE PRIVATE LIMITED

Similar Jobs

Network Operations Engineer

Network Operations Engineer

Network Engineer

Network Engineer

DevOps Engineer