SalaryPeak

Site Reliability Engineer (AI & Automation)

TOSS-EX PR PTE. LTD.
Singapore 7+ years Posted Mar 23, 2026

Salary Range

SGD 96,000 - SGD 132,000 /year

SGD 8,000 - SGD 11,000/month

Skills Required

UNIX System AdministrationScalabilityHigh AvailabilityGomonitoring SLAsTechnological ProficiencyPerformingReliability EngineeringPythonManagement ControlAutomationJavaPerformance Management

Job Description

Job Summary

We are seeking a highly skilled Site Reliability Engineer (SRE) with strong expertise in AI-driven operations, automation, and cloud platforms. The ideal candidate will be responsible for ensuring high availability, performance, scalability, and reliability of mission-critical systems while leveraging AI/ML and automation tools to enhance operational efficiency and incident management.

Key Responsibilities

Reliability & Operations

  • Ensure high availability, scalability, and performance of production systems.
  • Define and manage SLIs, SLOs, and SLAs.
  • Perform root cause analysis (RCA) and implement preventive measures.
  • Manage incident response, problem management, and postmortems.

Automation & AI Integration

  • Design and implement AI-driven monitoring, alerting, and anomaly detection solutions.
  • Automate repetitive operational tasks using scripts, workflows, and orchestration tools.
  • Leverage AIOps platforms to predict and prevent incidents.
  • Build self-healing systems using automation frameworks.

Cloud & Infrastructure

  • Manage and optimize infrastructure on cloud platforms (GCP/AWS/Azure).
  • Implement Infrastructure as Code (IaC) using tools like Terraform or CloudFormation.
  • Ensure resilience, failover strategies, and disaster recovery readiness.

    Required Skills & Qualifications
  • Technical Skills
  • Strong experience in Linux/Unix systems administration
  • Proficiency in Python, Java, or Go for automation
  • Hands-on experience with containerization (Docker, Kubernetes)
  • Experience with CI/CD tools (Jenkins, GitHub Actions, GitLab CI)
  • Expertise in cloud platforms (GCP preferred, AWS/Azure acceptable)
  • Knowledge of Infrastructure as Code (Terraform, Ansible, Puppet)
  • AI & Automation
  • Experience with AIOps tools (e.g., Dynatrace, Moogsoft, Datadog AI features)
  • Understanding of machine learning basics for anomaly detection
  • Experience in building or integrating automation frameworks and bots
  • Familiarity with chatbots, auto-remediation scripts, and predictive analytics
  • SRE Practices
  • Strong understanding of SRE principles

  • Experience with incident management and reliability engineering
  • Knowledge of capacity planning and performance tuning