Posted on 
Jun 6, 2024

Manager, Cloud Operations Engineering

Roseland
Manager
Operations, Engineering
CoreWeave
CoreWeave
CoreWeave
Private
101-250
Software, Security & Developer Tools

CoreWeave is a specialized cloud provider focused on GPU accelerated use cases including VFX, AI/ML, Batch Processing and Real Time Experiences. We support countless AI/ML services in the text to image, NLP and broader AI/ML space, reducing client’s infrastructure management requirements with our Kubernetes based serverless GPU cloud offerings.

Job Description

About the role:

The Cloud Operations Team is the heart of CoreWeave’s operational practice.  This team responds to performance and availability issues across the CoreWeave cloud, bridging the gap between Customer Support and internal Service Owning teams.  Working in shifts ensuring 24x7 coverage, the team develops proactive health monitoring, triage alerts and incidents serving in the commander role during Priority Incident events, and participates in ongoing analysis and reliability improvement practices.   

Collaborating across development and engineering, this team operates horizontally and vertically within the CoreWeave ecosystem to root out problems, initiate and coordinate responses, and drive lower MTTR and MTTD scores.

The newly formed team is staffed with resources who have broad technology and troubleshooting skills and are actively expanding their knowledge in critical areas such as networking, storage, Kubernetes, automation, and observability. You will bootstrap the team’s processes and procedures and be their direct Manager. 

As the people leader for this team of 8 Operations Engineers, you will facilitate and empower their success.  Drawing on your experience in Cloud Operations, you understand deeply the importance of process, documentation and automation. You strive for continual improvement.  You will maintain a close working relationship with each of your team members through regular 1:1s focusing on the ‘whole engineer’ guiding them in their skills and career development at CoreWeave.  Resources on your team are likely to mature into strong individual contributors to peer engineering teams across the organization and you will help them prepare while simultaneously providing exceptional support to those same teams.  

As Manager of the Cloud Operations Team you will:

  • Grow, change, invest in your teammates, be invested-in, share your ideas, listen to others, be curious, have fun, and above all, be yourself.
  • Learn and navigate the tools, systems and processes that enable the AI cloud.
  • Bootstrap the team’s operational processes and road map key project work and tooling requirements for the team’s success.
  • Own staffing, scheduling and HR responsibilities.
  • Develop and lead team cadence and planning sessions in conjunction with our Technical Project Manager.
  • Develop internal processes, procedures, and documentation to ensure efficient management of the team’s workload.
  • Track and report on key metrics that represent the team’s improvement and impact.
  • Act as the Sr. Incident Commander, and develop the team’s ability to efficiently operate Major Incidents.
  • Participate as a key member of the enterprise ITSM cadence, reporting on incident trends, durations (MTTR, MTTD etc.), problems, and Incident Reviews.
  • Own the Post Incident Review process.
  • Continually improve our incident response process with the goal of iteratively reducing MTTR through all reasonable methods (tooling, process, automation etc.).
  • Partner across service owners, SRE, Customer Support, to ensure process alignment, knowledge sharing and shared responsibility regarding Incident Management, Post Incident Reviews, Production Readiness Assessments etc.

Wondering if you’re a good fit? We believe in investing in our people, and value candidates who can bring their own diversified experiences to our teams – even if you aren't a 100% skill or experience match. Here are some qualities we’ve found compatible with our team. If a portion of this resonates with you, we’d love to talk.

  • You come with your own philosophies and strategies, are adaptable to new information, and freely provide feedback, coaching, and being an active participant in improving how the team functions.
  • You have experience with business process development and can see where communication breakdowns are likely to occur.
  • You are committed to understanding the needs of others, and how you can effectively lever your own talents to ensure collective success.
  • You are comfortable using observability data to visualize service health, and triangulate proximate cause of performance and availability issues.
  • You are comfortable making sense of complex environments and leading others through troubleshooting without actively fixing things yourself.
  • You can lead when there’s ambiguity, and following when engineers lead.
  • You have experience in a support capacity and/or a broad understanding of modern applications and infrastructure.
  • You are comfortable managing communication and coordinating multiple engineers during an incident.
  • You have a desire to learn or have experience with process automation.
  • You have a customer first mindset and bring empathy for the customer as well as the engineering team who’s tasked with solving complex problems.
  • You’re excited to join a team with diverse perspectives and backgrounds that believe in tackling challenges, growing hand in hand, and winning together.

Our compensation reflects the cost of labor across several US geographic markets. The base pay for this position ranges from $180,000 to $200,000/year. Pay is based on a number of factors including market location and may vary depending on job-related knowledge, skills, and experience.

Hybrid Workplace

If you reside within a 30-mile radius of our New Jersey, New York, or Philadelphia offices, we're excited for you to join us at the office at least three times a week, recognizing the significance we place on fostering connections, collaboration, and creativity within our office culture. Our commitment to operating as a hybrid workplace underscores our dedication to enabling our employees to tailor their work-life balance to their individual preferences.

CoreWeave is a specialized cloud provider, delivering a massive scale of GPU compute resources on top of the industry’s fastest and most flexible infrastructure. CoreWeave builds cloud solutions for compute intensive use cases — VFX and rendering, machine learning and AI, batch processing, and Pixel Streaming — that are up to 35 times faster and 80% less expensive than the large, generalized public clouds. Learn more at www.coreweave.com.

About the role:

The Cloud Operations Team is the heart of CoreWeave’s operational practice.  This team responds to performance and availability issues across the CoreWeave cloud, bridging the gap between Customer Support and internal Service Owning teams.  Working in shifts ensuring 24x7 coverage, the team develops proactive health monitoring, triage alerts and incidents serving in the commander role during Priority Incident events, and participates in ongoing analysis and reliability improvement practices.   

Collaborating across development and engineering, this team operates horizontally and vertically within the CoreWeave ecosystem to root out problems, initiate and coordinate responses, and drive lower MTTR and MTTD scores.

The newly formed team is staffed with resources who have broad technology and troubleshooting skills and are actively expanding their knowledge in critical areas such as networking, storage, Kubernetes, automation, and observability. You will bootstrap the team’s processes and procedures and be their direct Manager. 

As the people leader for this team of 8 Operations Engineers, you will facilitate and empower their success.  Drawing on your experience in Cloud Operations, you understand deeply the importance of process, documentation and automation. You strive for continual improvement.  You will maintain a close working relationship with each of your team members through regular 1:1s focusing on the ‘whole engineer’ guiding them in their skills and career development at CoreWeave.  Resources on your team are likely to mature into strong individual contributors to peer engineering teams across the organization and you will help them prepare while simultaneously providing exceptional support to those same teams.  

As Manager of the Cloud Operations Team you will:

  • Grow, change, invest in your teammates, be invested-in, share your ideas, listen to others, be curious, have fun, and above all, be yourself.
  • Learn and navigate the tools, systems and processes that enable the AI cloud.
  • Bootstrap the team’s operational processes and road map key project work and tooling requirements for the team’s success.
  • Own staffing, scheduling and HR responsibilities.
  • Develop and lead team cadence and planning sessions in conjunction with our Technical Project Manager.
  • Develop internal processes, procedures, and documentation to ensure efficient management of the team’s workload.
  • Track and report on key metrics that represent the team’s improvement and impact.
  • Act as the Sr. Incident Commander, and develop the team’s ability to efficiently operate Major Incidents.
  • Participate as a key member of the enterprise ITSM cadence, reporting on incident trends, durations (MTTR, MTTD etc.), problems, and Incident Reviews.
  • Own the Post Incident Review process.
  • Continually improve our incident response process with the goal of iteratively reducing MTTR through all reasonable methods (tooling, process, automation etc.).
  • Partner across service owners, SRE, Customer Support, to ensure process alignment, knowledge sharing and shared responsibility regarding Incident Management, Post Incident Reviews, Production Readiness Assessments etc.

Wondering if you’re a good fit? We believe in investing in our people, and value candidates who can bring their own diversified experiences to our teams – even if you aren't a 100% skill or experience match. Here are some qualities we’ve found compatible with our team. If a portion of this resonates with you, we’d love to talk. 

  • You come with your own philosophies and strategies, are adaptable to new information, and freely provide feedback, coaching, and being an active participant in improving how the team functions.
  • You have experience with business process development and can see where communication breakdowns are likely to occur.
  • You are committed to understanding the needs of others, and how you can effectively lever your own talents to ensure collective success.
  • You are comfortable using observability data to visualize service health, and triangulate proximate cause of performance and availability issues.
  • You are comfortable making sense of complex environments and leading others through troubleshooting without actively fixing things yourself.
  • You can lead when there’s ambiguity, and following when engineers lead.
  • You have experience in a support capacity and/or a broad understanding of modern applications and infrastructure.
  • You are comfortable managing communication and coordinating multiple engineers during an incident.
  • You have a desire to learn or have experience with process automation.
  • You have a customer first mindset and bring empathy for the customer as well as the engineering team who’s tasked with solving complex problems.
  • You’re excited to join a team with diverse perspectives and backgrounds that believe in tackling challenges, growing hand in hand, and winning together.

Our compensation reflects the cost of labor across several US geographic markets. The base pay for this position ranges from $180,000 to $200,000/year. Pay is based on a number of factors including market location and may vary depending on job-related knowledge, skills, and experience.

Hybrid Workplace

If you reside within a 30-mile radius of our New Jersey, New York, or Philadelphia offices, we're excited for you to join us at the office at least three times a week, recognizing the significance we place on fostering connections, collaboration, and creativity within our office culture. Our commitment to operating as a hybrid workplace underscores our dedication to enabling our employees to tailor their work-life balance to their individual preferences.

Receive Tech Ladies'
newest jobs in your inbox,
every week.

Join Tech Ladies for full-access to the job board, member-only events, and more!

If you're already a member, we haven't forgotten you. We promise. It's a new system. If you fill out the form once, it'll remember you going forward. Apologies for the inconvenience.

Roseland
Roseland
No items found.
Operations
Operations
Engineering
Engineering
Hybrid
Hybrid