Purpose of Role:

Frontline On-Call Ownership: Serve as the primary responder for the Applied Machine Learning Engine, taking ownership of system availability, health monitoring, and immediate incident response to ensure high reliability.
Incident Lifecycle Management: Manage the end-to-end feedback loop for incidents, including rapid triage, effective resolution, and the facilitation of post-incident reviews to ensure closure and prevent recurrence.
SOP Execution & Optimization: Execute upgrades and deployments strictly adhering to Standard Operating Procedures (SOPs), while actively leveraging Machine Learning and Infrastructure expertise to refine, automate, and improve these processes for greater efficiency.

Responsibilities:

Analyse all kinds of user needs related to machine learning systems provided by AML department , through oncall shifting or any other mechanisms, then propose customer oriented solutions .
Work with other software engineers to implement and deploy customer-oriented machine learning framework related solutions which are proposed by oneself or not .
Update software, enhances existing software capabilities, and develops or deploy software testing 、deployment 、capacity management and validation procedures.
Work with computer hardware engineers to integrate hardware and software systems and trouble-shooting specifications and performance requirements.

Minimum requirements:

Bachelor’s degree in Computer Science or equivalent with 3+ years of relevant experience
Proven experience in analyzing and troubleshooting distributed systems.
Prior experience designing or maintaining large-scale systems.
Scripting skills in at least one major language (Python, Go, or Shell/Bash) to automate repetitive operational tasks.

Nice to have:

Experience defining and managing Service Level Indicators (SLIs), Service Level Objectives (SLOs), error budgets, and practicing Chaos Engineering.
Experience operating MLOps platforms and toolkits such as Kubeflow, MLflow, Feast, or Ray.
Deep understanding of Linux operating system internals or container technologies (Docker/Containerd) and orchestration platforms (Kubernetes) in a production environment.
Basic understanding of Machine Learning concepts and familiarity with frameworks like TensorFlow Serving, TorchServe, or Triton Inference Server

Site Reliability Engineer, Machine Learning Operations