Stability AI | HPC Engineer (Remote) @ Stability AI

Stability AI | HPC Engineer (Remote)

United States · Remote

Entry Level +1 · Full time

Posted 7 months ago

About Stability:

Stability AI is a community and mission driven, open-source artificial intelligence company that cares deeply about real-world implications and applications. Our most considerable advances grow from our diversity in working across multiple teams and disciplines. We are unafraid to go against established norms and explore creativity. We are motivated to generate breakthrough ideas and convert them into tangible solutions. Our vibrant communities consist of experts, leaders and partners across the globe who are developing cutting-edge open AI models for Image, Language, Audio, Video, 3D and Biology.

About the role:

We are looking for a talented Engineer with a focus on High-Performance Computing that will work with a growing multidisciplinary team of talented research scientists and machine learning engineers to improve and scale the efficiency within our computing capacity. Stability AI operates a very large HPC cluster for training foundational AI models across several modalities. Operating, automating, monitoring and troubleshooting issues with the cluster is strategically important to the long-term success of the business. This HPC Engineer role is critically important to our company and the ideal candidate will possess a passion for making incremental, measurable improvements, as well as solving unique problems that have yet to be solved in our industry.

Responsibilities:

Maintain HPC Clusters Operations: Ensure the smooth operation of HPC clusters, including routine maintenance, software updates, and hardware optimizations
Monitor and Recover Dead Nodes: Continuously monitor cluster nodes, identify dead nodes, and implement recovery procedures to minimize downtime
Documentation: Maintain detailed documentation of dead node incidents, their root causes, and resolutions for future reference and improvement
Shared Volumes Management: Monitor the health and usage of shared volumes, and collaborate with users to enforce cleanup procedures
POSIX Permissions Enforcement: Monitor and contact users who do not adhere to POSIX permissions standards on shared storage to enhance security
HPC Help Center Support: Monitor and respond to user queries and issues submitted to the HPC Help Center, providing timely solutions and assistance
Job Launch Support: Assist users in launching jobs efficiently, reducing the need for constant supervision and ensuring optimal job execution
Optimizing Low-Priority Jobs: Guide users on maximizing the utilization of low-priority jobs through strategies such as preemption robustness and auto-requeueing
S3 Access Permissions: Maintain and troubleshoot S3 access permissions, resolving access issues and ensuring data integrity
Interactive Job Monitoring: Monitor all CPU clusters for users who forget to end interactive jobs and take appropriate actions to maintain cluster availability
Authentication and Authorization: Develop and maintain processes related to authentication, authorization, and accounting for cluster usage, ensuring secure access management
Security Measures: Implement and enhance security protocols for HPC clusters, including tools for rapid access removal in case of security risks
Slurm Scheduling Deployment: Convert and deploy Slurm scheduling for various cloud resources, including Kubernetes (K8s), TPUs, and Trainium
Slurm Support: Issue and resolve Slurm support tickets with external Slurm support providers to address scheduling and cluster management issues
AWS Resource Management: Maintain and manage AWS resources associated with HPC clusters, including login nodes, S3 buckets, FSx volumes, VPCs, subnets, NAT Gateways, S3 VPC Endpoints, and routing tables

Requirements:

Bachelor's degree in computer science, information technology, or a related field. Master's degree preferred
Proven experience in high-performance computing (HPC) administration and maintenance
Proficiency in HPC cluster management tools and technologies, with a strong focus on Slurm scheduling
Knowledge of cloud computing platforms, particularly AWS, and experience with managing associated resources
Strong scripting and programming skills (e.g., Bash, Python) for automation and system optimization
Familiarity with authentication, authorization, and accounting (AAA) processes for cluster usage
Understanding of security best practices and the ability to quickly respond to security threats
Excellent communication skills to effectively collaborate with users, solve issues, and provide guidance
Attention to detail and the ability to document processes and solutions effectively

Equal Employment Opportunity:

We are an equal opportunity employer and do not discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, veteran status, disability or other legally protected statuses.

Note From The Remote JobHunters:

When you apply, be sure to mention that you heard about this position from The Remote JobHunters!
Looking for more entry-level remote positions? If so, be sure to join The Remote JobHunters Facebook group, Subreddit, and subscribe to our weekly newsletter!

Stability AI

We are building the foundation to activate humanity’s potential.

Size: 101-250 employees

Get started on Pallet

This community is on Pallet — Where creators turn their community into recruiting networks