Failures on a few newly scaled up replicas on A100s, H100s, and L4s

Incident Report for Baseten

Resolved

All bad nodes were cycled as of 20 minutes ago. All replicas, old and new, on A100s, H100s, and L4s are working well.
Posted Jun 25, 2024 - 14:22 PDT

Monitoring

We're monitoring the fix.
Posted Jun 25, 2024 - 14:10 PDT

Identified

We’ve identified the problem with the GPU drivers on some of our nodes. Cycling the bad nodes now.
Posted Jun 25, 2024 - 13:42 PDT

Investigating

Newly scaled up replicas on some A100s, H100s, and L4s are failing with an NVIDIA driver error. We're investigating.
Posted Jun 25, 2024 - 13:03 PDT
This incident affected: Model Inference.