About this role
Role Overview
As a Site Reliability Engineer, you will play a crucial role in training and optimizing AI models within advanced containerized infrastructures. This position focuses on real-time troubleshooting and dynamic process recovery, providing an opportunity to engage in a high-intensity project with potential for future extensions based on performance.
Key Responsibilities- Lead the deployment, monitoring, and recovery of complex, containerized AI training environments using advanced terminal techniques.
- Proactively identify, diagnose, and resolve infrastructure bottlenecks and failures in long-running processes.
- Orchestrate resilient system builds and manage infrastructure to ensure stability and optimal resource utilization.
- Collaborate closely with engineering teams to refine CI/CD pipelines and automate routine operational tasks.
- Manage and optimize filesystem structures, networked storage, and process scheduling in Dockerized sandboxes.
- Conduct rapid mid-execution replanning during error states and unforeseen runtime issues.
- Document best practices, emergent solutions, and contribute to knowledge transfer across the team.
- Demonstrated expert proficiency with terminal-based problem solving and complex system administration.
- Mastery of dynamic infrastructure recovery and long-running operational process management.
- Deep expertise in containerized environments (e.g., Docker, Kubernetes) and sandbox orchestration.
- Strong Python skills, with the ability to script, automate, and debug real-world production systems.
- Proficiency in Bash and familiarity with JavaScript/TypeScript, Go, Rust, C/C++.
- Experience with build systems, package managers, databases, version control, and cryptography tools.
- Adept at troubleshooting, documenting, and replanning in high-velocity technical environments.
- Background in machine learning operations or AI infrastructure.
- Familiarity with ML frameworks and distributed computing.
- Experience supporting multi-phase, high-intensity engineering projects.
Employment Type: Contract
CompensationHourly rate ranges from $40 to $70.
EligibilityThis position is fully remote.