Site Reliability Engineer, Machine Learning Operations
MANPOWER STAFFING SERVICES (SINGAPORE) PTE LTD
Singapore
3+ years
Posted Mar 5, 2026
Salary Range
SGD 90,000 - SGD 102,000 /year
SGD 7,500 - SGD 8,500/month
Skills Required
Machine LearningDevOpsLarge Scale SystemsGoCustomer Centric SolutionsBashComputer ScienceHardware EngineeringTroubleshoot system failuresPythonSoftware TestingSite Reliability EngineeringBash/Shell/PowerShellsystem requirements specificationLinux
Job Description
Purpose of Role:
- Frontline On-Call Ownership: Serve as the primary responder for the Applied Machine Learning Engine, taking ownership of system availability, health monitoring, and immediate incident response to ensure high reliability.
- Incident Lifecycle Management: Manage the end-to-end feedback loop for incidents, including rapid triage, effective resolution, and the facilitation of post-incident reviews to ensure closure and prevent recurrence.
- SOP Execution & Optimization: Execute upgrades and deployments strictly adhering to Standard Operating Procedures (SOPs), while actively leveraging Machine Learning and Infrastructure expertise to refine, automate, and improve these processes for greater efficiency.
Responsibilities:
- Analyse all kinds of user needs related to machine learning systems provided by AML department , through oncall shifting or any other mechanisms, then propose customer oriented solutions .
- Work with other software engineers to implement and deploy customer-oriented machine learning framework related solutions which are proposed by oneself or not .
- Update software, enhances existing software capabilities, and develops or deploy software testing 、deployment 、capacity management and validation procedures.
- Work with computer hardware engineers to integrate hardware and software systems and trouble-shooting specifications and performance requirements.
Minimum requirements:
- Bachelor’s degree in Computer Science or equivalent with 3+ years of relevant experience
- Proven experience in analyzing and troubleshooting distributed systems.
- Prior experience designing or maintaining large-scale systems.
- Scripting skills in at least one major language (Python, Go, or Shell/Bash) to automate repetitive operational tasks.
Nice to have:
- Experience defining and managing Service Level Indicators (SLIs), Service Level Objectives (SLOs), error budgets, and practicing Chaos Engineering.
- Experience operating MLOps platforms and toolkits such as Kubeflow, MLflow, Feast, or Ray.
- Deep understanding of Linux operating system internals or container technologies (Docker/Containerd) and orchestration platforms (Kubernetes) in a production environment.
- Basic understanding of Machine Learning concepts and familiarity with frameworks like TensorFlow Serving, TorchServe, or Triton Inference Server
About MANPOWER STAFFING SERVICES (SINGAPORE) PTE LTD
Similar Jobs
Project Manager (Brownfield Energy Infrastructure)
MANPOWER STAFFING SERVICES (SINGAPORE) PTE LTD
SGD 120,000 - SGD 168,000/yr
Technical Support Engineer (Semiconductor Equipment)
MANPOWER STAFFING SERVICES (SINGAPORE) PTE LTD
SGD 54,000 - SGD 72,000/yr
Software Engineer, C#/C++
MANPOWER STAFFING SERVICES (SINGAPORE) PTE LTD
SGD 54,000 - SGD 72,000/yr
Senior Project Manager
MANPOWER STAFFING SERVICES (SINGAPORE) PTE LTD
SGD 108,000 - SGD 120,000/yr
Oracle Database Administrator | Contract
MANPOWER STAFFING SERVICES (SINGAPORE) PTE LTD
SGD 96,000 - SGD 114,000/yr