SalaryPeak

Senior Site Reliability Engineer (SRE)

TOKENINSIGHT SG PTE. LTD.
Singapore 5+ years Posted Yesterday

Salary Range

SGD 120,000 - SGD 240,000 /year

SGD 10,000 - SGD 20,000/month

Skills Required

development & engineeringOperational ExcellenceKuberneteshands-on skillsPipeline DevelopmentReliability Engineering ManagementComputer ScienceWebSocketsProgrammingdesign for reliability

Job Description

Responsibilities

We are looking for an experienced Site Reliability Engineer who is passionate about building reliable, scalable, and automated infrastructure to support mission-critical platform services.

What You'll Do

  • Ensure the reliability, availability, and operational excellence of critical platform services and infrastructure.
  • Design, deploy, maintain, and optimize cloud-native infrastructure based on Kubernetes and Docker.
  • Build and improve observability systems including monitoring, alerting, logging, and distributed tracing.
  • Participate in architecture reviews and provide reliability-focused recommendations for high-concurrency, low-latency distributed systems.
  • Develop and maintain CI/CD pipelines to improve engineering productivity and deployment quality.
  • Lead capacity planning, performance tuning, disaster recovery planning, and resilience engineering initiatives.
  • Drive Infrastructure as Code (IaC) adoption and automation to reduce operational overhead and human error.
  • Define and continuously improve SLI/SLO/SLA frameworks across critical services.
  • Participate in incident response, root cause analysis (RCA), and postmortem reviews for production issues.
  • Collaborate closely with engineering, QA, product, and security teams to continuously improve platform reliability, scalability, and efficiency.
  • Leverage AI-powered tools (e.g., Cursor, Claude Code, ChatGPT) to enhance operational automation, troubleshooting, and productivity.

Requirements

Must-Have Skills

  • Bachelor's degree or above in Computer Science or a related field.
  • 5+ years of experience in SRE, DevOps, Infrastructure Engineering, or related roles.
  • Strong knowledge of Linux systems and performance optimization.
  • Proficiency in at least one programming language such as Go, Python, Java, or Rust.
  • Hands-on experience with Kubernetes, Docker, and cloud-native ecosystems.
  • Experience with CI/CD tools such as GitHub Actions, GitLab CI, or Jenkins.
  • Solid understanding of networking fundamentals including TCP/IP, HTTP, and WebSocket.
  • Strong troubleshooting, performance analysis, and capacity planning skills.
  • Experience building automation tools and operational platforms.
  • Demonstrated proficiency in AI-assisted development and operations tools such as Cursor and Claude Code.

Technical Stack

Container Platforms

  • Kubernetes
  • Docker

Observability

  • Prometheus
  • Grafana
  • Loki
  • ELK
  • OpenTelemetry

Messaging Systems

  • Kafka
  • RocketMQ
  • Redis

Databases

  • MySQL
  • PostgreSQL
  • ClickHouse
  • Time-Series Databases

Infrastructure Automation

  • Terraform
  • Ansible
  • Helm

Cloud Platforms

  • AWS
  • GCP
  • Alibaba Cloud
  • Tencent Cloud

CI/CD

  • GitHub Actions
  • GitLab CI
  • Jenkins

Preferred Experience

  • Experience in large-scale internet, SaaS, fintech, e-commerce, or mission-critical platform environments.
  • Experience supporting high-concurrency distributed systems.
  • Strong understanding of distributed system architecture, scalability, and reliability engineering principles.
  • Experience operating multi-region or multi-datacenter infrastructure.

Nice to Have

  • Experience managing large-scale Kubernetes clusters (1,000+ nodes).
  • Hands-on experience with Service Mesh technologies (e.g., Istio) and OpenTelemetry.
  • Expertise in Kafka, ClickHouse, and large-scale distributed system optimization.
  • Experience implementing Chaos Engineering practices.
  • Strong background in incident management and large-scale production recovery.
  • Experience with AIOps, intelligent alerting, and automated fault diagnosis systems.