Job Description

We are representing a client—an innovative, fast-growing company in the robotics and AI space. The company is building an open-source humanoid robotics platform and is committed to democratizing access to advanced robotics by openly sharing hardware designs, firmware, and machine learning models with the global developer community.

The Role

As an Infrastructure Engineer, you will own and evolve the platform that everything the firm runs on, from inference serving to training rigs to the agentic coding infrastructure that powers day-to-day engineering. You will work deep in the stack, across OpenStack, Kubernetes, and bare metal, and set the technical direction for how the company's Cloud scales.

What You'll Do

Own and operate the company's Cloud: OpenStack Nova compute, Neutron networking, Trove database services, across Prod, Dev, and Sysadmin clusters
Manage Kubernetes clusters via Cluster API and kubeadm, including control plane operations, node lifecycle, and cluster upgrades
Manage and improve our inference platform: vLLM serving, AIBrix for multi-model orchestration and autoscaling across a fleet of NVIDIA GPUs
Build and maintain autoscaling at every layer: Cluster Autoscaler, HPA, and KEDA for event-driven workload scaling
Operate platform services: Kafka, Redis, PostgreSQL, OpenSearch, Prometheus
Own the observability stack: Grafana, Mimir, Tempo, Loki, Pyroscope, OnCall, one pane of glass across all clusters
Manage GitOps deployments via ArgoCD and identity via Keycloak integrated with Google Workspace
Harden network security across private load balancers, firewalls, and VPC segmentation
Support training infrastructure: self-service VM provisioning, RunPod burst capacity, Weights and Biases integration
Drive infrastructure reliability, cost efficiency, and capacity planning as the platform scales

What We're Looking For

5+ years of hands-on infrastructure engineering experience in production environments
Extensive experience with OpenStack in production: Nova, Neutron, Cinder, Trove, Horizon, and CLI administration
Strong Kubernetes experience without managed control planes: Cluster API, kubeadm, self-managed clusters
Deep Linux proficiency: RHEL, Ubuntu, or equivalent, including kernel-level debugging and performance tuning
Experience with infrastructure-as-code and automation: Ansible, Terraform, or equivalent
Familiarity with GPU infrastructure: inference serving, vLLM, model orchestration, and cluster management
Solid understanding of GitOps workflows and tools like ArgoCD
Experience with observability: Prometheus, Grafana, distributed tracing, log aggregation
Strong networking fundamentals: VPCs, firewalls, load balancers, private cluster architecture
Comfort operating in a high-ownership environment where you make architecture decisions and move fast

Bonus points for:

Experience with KVM virtualization and storage backends like Ceph
Familiarity with vLLM internals: PagedAttention, continuous batching, tensor parallelism
Experience with KEDA or event-driven autoscaling patterns
Background in AI/ML infrastructure or GPU cluster operations at scale
Prior open-source contributions to OpenStack, Kubernetes, or adjacent projects

Infrastructure Engineer

Skills Required

Job Description

The Role

What You'll Do

What We're Looking For

Bonus points for:

About SECOND TALENT SG PTE. LTD.

Similar Jobs

Senior Solution Engineer

Solution Engineer

AI Application Full-Stack Developer

Distributed Systems Engineer

Lead Software Engineer (2 Years Contract Renewable- Govt)