Failures on a few newly scaled up replicas on A100s, H100s, and L4s
Incident Report for Baseten
Resolved
All bad nodes were cycled as of 20 minutes ago. All replicas, old and new, on A100s, H100s, and L4s are working well.
Posted Jun 25, 2024 - 14:22 PDT
Monitoring
We're monitoring the fix.
Posted Jun 25, 2024 - 14:10 PDT
Identified
We’ve identified the problem with the GPU drivers on some of our nodes. Cycling the bad nodes now.
Posted Jun 25, 2024 - 13:42 PDT
Investigating
Newly scaled up replicas on some A100s, H100s, and L4s are failing with an NVIDIA driver error. We're investigating.
Posted Jun 25, 2024 - 13:03 PDT
This incident affected: Model Inference.