Posted on 
Nov 7, 2023

Senior Observability SRE - Engineering Platforms

Mid-Senior ICs
Engineering, IT
CoreWeave
CoreWeave
CoreWeave
Private
101-250
Software, Security & Developer Tools

CoreWeave is a specialized cloud provider focused on GPU accelerated use cases including VFX, AI/ML, Batch Processing and Real Time Experiences. We support countless AI/ML services in the text to image, NLP and broader AI/ML space, reducing client’s infrastructure management requirements with our Kubernetes based serverless GPU cloud offerings.

Job Description

About the role:

The Engineering Platforms Team functions as the lubricant that keeps CoreWeave’s gears of innovation turning fast and friction-free. This team is responsible for the development, integration, and operation of platforms central to the engineering experience with the ultimate objective of enabling engineers across CoreWeave to do more, better. Central to the Engineering Platforms mission is the operation of our observability, CI/CD, and service catalog systems which leverage CoreWeave’s deep investment in the Kubernetes ecosystem. Engineers on this team will endeavor to discover and remove engineer friction across CoreWeave’s engineering teams through the development of boilerplate, integrations, automation and the operation of shared platforms.

We are seeking a senior engineer with specialization in the observability stack who can help us execute on the mission of providing a comprehensive logging and metrics ecosystem. Integrating logging, metrics, tracing, and monitoring tools for proactive insights into system performance. This individual will work with a team of 8-10 mixed-specialization engineers and have the opportunity to work on the full gamut of rewarding challenges that come with the business of building a cloud in a communicative, supportive, and high-performing environment. As a member of the Engineering Platforms Team you would have the opportunity to:

  • Design and implement services and tools to reduce friction and toil in the lives of our engineering and operations.
  • Improve the performance, security, reliability, and scalability of our observability, and related services and participate in the Engineering Platforms on-call rotation.
  • Develop dashboards, alerts, and insights into the customer experience using Grafana-ecosystem tools such as Mimir and Loki.
  • Enable and evangelize the best practices around alerting. Collaborate with teams to establish observability standards.
  • Grow, change, invest in your teammates, be invested-in, share your ideas, listen to others, be curious, have fun, and, above all, be yourself.
  • Participate in other engineering duties that this team is responsible for like operating, scaling and maintaining CI/CD, other related services and participate in the Engineering Platforms on-call rotation.

Wondering if you’re a good fit? We believe in investing in our people, and value candidates who can bring their own diversified experiences to our teams – even if you aren't a 100% skill or experience match. Here are some qualities we’ve found compatible with our team. If a portion of this resonates with you, we’d love to talk.

  • You have four or more years of experience in a software or infrastructure engineering industry.
  • You enjoy helping your colleagues achieve more with less effort.
  • You have experience operating services in production and at scale and are versed in reliability engineering concepts such as the different types of testing, progressive deployments, error budgets, the role observability, and fault-tolerant design.
  • You’re familiar with various logging and metrics systems like ELK, Victoria Metrics, Thanos  or Grafana. You have experience with designing and operating these systems at scale.
  • You are familiar with LogQL and PromQL and enjoy understanding the data model for observability systems.
  • You’re familiar with Kubernetes and have interest or experience with using it for event-driven and/or stateful orchestration.
  • You’re comfortable with the idea of using Go as your primary programming language.
  • You know your way around a Linux distro, shell scripting, and/or the Linux storage and networking stacks.
  • You can transform problems in elastic solutions, decompose them into achievable tasks, and socialize both to your teammates.
  • You’re excited about being part of a team of diverse perspectives and backgrounds that believe in tackling challenges, growing hand in hand, and winning together.

Our compensation reflects the cost of labor across several US geographic markets. The base pay for this position ranges from $165,000/year in our lowest geographic market up to $220,000/year in our highest geographic market. Pay is based on a number of factors including market location and may vary depending on job-related knowledge, skills, and experience.  

 

About the role:

The Engineering Platforms Team functions as the lubricant that keeps CoreWeave’s gears of innovation turning fast and friction-free. This team is responsible for the development, integration, and operation of platforms central to the engineering experience with the ultimate objective of enabling engineers across CoreWeave to do more, better. Central to the Engineering Platforms mission is the operation of our observability, CI/CD, and service catalog systems which leverage CoreWeave’s deep investment in the Kubernetes ecosystem. Engineers on this team will endeavor to discover and remove engineer friction across CoreWeave’s engineering teams through the development of boilerplate, integrations, automation and the operation of shared platforms.

We are seeking a senior engineer with specialization in the observability stack who can help us execute on the mission of providing a comprehensive logging and metrics ecosystem. Integrating logging, metrics, tracing, and monitoring tools for proactive insights into system performance. This individual will work with a team of 8-10 mixed-specialization engineers and have the opportunity to work on the full gamut of rewarding challenges that come with the business of building a cloud in a communicative, supportive, and high-performing environment. As a member of the Engineering Platforms Team you would have the opportunity to:

  • Design and implement services and tools to reduce friction and toil in the lives of our engineering and operations.
  • Improve the performance, security, reliability, and scalability of our observability, and related services and participate in the Engineering Platforms on-call rotation.
  • Develop dashboards, alerts, and insights into the customer experience using Grafana-ecosystem tools such as Mimir and Loki.
  • Enable and evangelize the best practices around alerting. Collaborate with teams to establish observability standards.
  • Grow, change, invest in your teammates, be invested-in, share your ideas, listen to others, be curious, have fun, and, above all, be yourself.
  • Participate in other engineering duties that this team is responsible for like operating, scaling and maintaining CI/CD, other related services and participate in the Engineering Platforms on-call rotation.

Wondering if you’re a good fit? We believe in investing in our people, and value candidates who can bring their own diversified experiences to our teams – even if you aren't a 100% skill or experience match. Here are some qualities we’ve found compatible with our team. If a portion of this resonates with you, we’d love to talk. 

  • You have four or more years of experience in a software or infrastructure engineering industry.
  • You enjoy helping your colleagues achieve more with less effort.
  • You have experience operating services in production and at scale and are versed in reliability engineering concepts such as the different types of testing, progressive deployments, error budgets, the role observability, and fault-tolerant design.
  • You’re familiar with various logging and metrics systems like ELK, Victoria Metrics, Thanos  or Grafana. You have experience with designing and operating these systems at scale.
  • You are familiar with LogQL and PromQL and enjoy understanding the data model for observability systems. 
  • You’re familiar with Kubernetes and have interest or experience with using it for event-driven and/or stateful orchestration.
  • You’re comfortable with the idea of using Go as your primary programming language.
  • You know your way around a Linux distro, shell scripting, and/or the Linux storage and networking stacks.
  • You can transform problems in elastic solutions, decompose them into achievable tasks, and socialize both to your teammates.
  • You’re excited about being part of a team of diverse perspectives and backgrounds that believe in tackling challenges, growing hand in hand, and winning together.

Our compensation reflects the cost of labor across several US geographic markets. The base pay for this position ranges from $165,000/year in our lowest geographic market up to $220,000/year in our highest geographic market. Pay is based on a number of factors including market location and may vary depending on job-related knowledge, skills, and experience.  

 

Receive Tech Ladies'
newest jobs in your inbox,
every week.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Join Tech Ladies for full-access to the job board, member-only events, and more!

If you're already a member, we haven't forgotten you. We promise. It's a new system. If you fill out the form once, it'll remember you going forward. Apologies for the inconvenience.

No items found.
No items found.
Engineering
Engineering
IT
IT
In-Person
In-Person