Job Description

Job Summary

We are seeking a highly skilled Site Reliability Engineer (SRE) with strong expertise in AI-driven operations, automation, and cloud platforms. The ideal candidate will be responsible for ensuring high availability, performance, scalability, and reliability of mission-critical systems while leveraging AI/ML and automation tools to enhance operational efficiency and incident management.

Key Responsibilities

Reliability & Operations

Ensure high availability, scalability, and performance of production systems.
Define and manage SLIs, SLOs, and SLAs.
Perform root cause analysis (RCA) and implement preventive measures.
Manage incident response, problem management, and postmortems.

Automation & AI Integration

Design and implement AI-driven monitoring, alerting, and anomaly detection solutions.
Automate repetitive operational tasks using scripts, workflows, and orchestration tools.
Leverage AIOps platforms to predict and prevent incidents.
Build self-healing systems using automation frameworks.

Cloud & Infrastructure

Manage and optimize infrastructure on cloud platforms (GCP/AWS/Azure).
Implement Infrastructure as Code (IaC) using tools like Terraform or CloudFormation.
Ensure resilience, failover strategies, and disaster recovery readiness.

Required Skills & Qualifications
Technical Skills
Strong experience in Linux/Unix systems administration
Proficiency in Python, Java, or Go for automation
Hands-on experience with containerization (Docker, Kubernetes)
Experience with CI/CD tools (Jenkins, GitHub Actions, GitLab CI)
Expertise in cloud platforms (GCP preferred, AWS/Azure acceptable)
Knowledge of Infrastructure as Code (Terraform, Ansible, Puppet)
AI & Automation
Experience with AIOps tools (e.g., Dynatrace, Moogsoft, Datadog AI features)
Understanding of machine learning basics for anomaly detection
Experience in building or integrating automation frameworks and bots
Familiarity with chatbots, auto-remediation scripts, and predictive analytics
SRE Practices
Strong understanding of SRE principles
Experience with incident management and reliability engineering
Knowledge of capacity planning and performance tuning

Site Reliability Engineer (AI & Automation)

Skills Required

Job Description

Job Summary

Reliability & Operations

Automation & AI Integration

Cloud & Infrastructure

About TOSS-EX PR PTE. LTD.

Similar Jobs

Automation Testing

Senior Program Engineer (Operations)

Program Engineer (Cybersecurity)

Senior Program Engineer

Junior Program Engineer