Webhooks Events stuck
Incident Report for Monta
Postmortem

What happened

Since Friday, 08.03.24, at around 8am we had issues with delivering webhook events on Partner API. We had minor issues during the week which we were able to address by improving configuration.

Over the weekend we learned that our pending webhooks were growing. Monday morning we were able to deploy a fix. However, the amount of pending webhooks waiting for delivery was too much for our system/configuration.

Why did it happen

  • Huge amount of location updates of charge points via OCPI did result into hundreds of thousands changes (webhook events) - this led to a chain reaction:

    • Bad configuration didn’t allow our system to process this amount of messages
    • A missing (or: incorrect) index in our table surfaced due to the amount of messages
  • Our alerting on webhooks monitors failed webhooks – not pending webhooks

Impact

  • Webhooks were not or rarely delivered in the time from 08.03.24 8am to 11.03.24 2pm
  • Webhook events between 08.03.24 and 10.03.24 11am which were not delivered, got cleared (deleted) by our automation (we only keep webhook events for 24 hours)

Mitigation

  • Synchronize relevant entities using the Partner API to ensure your view on our system is up to date. Especially any pending charges or wallet-transactions.

Action Items (Monta)

  • We have used this “load-test” to significantly improve processing of webhooks in our system
  • We have set-up monitoring on pending events as well
  • Our clean-up job does not delete any items that are in state “Pending” moving forward
Posted Mar 11, 2024 - 15:35 UTC

Resolved
This issue has been resolved. Pending webhooks have been processed and webhooks are operational again.
Posted Mar 11, 2024 - 15:24 UTC
Update
We have deployed a fix which picked up on pending events and starts sending them (oldest first). It's running quite slow, we are optimizing the system in parallel.
Posted Mar 11, 2024 - 12:33 UTC
Update
A fix will go-live today. This should send all pending events and avoid this issue in future.
Posted Mar 11, 2024 - 08:39 UTC
Identified
Team has identified the issue and is working on a fix to improve the situation.
Posted Mar 08, 2024 - 17:45 UTC
Investigating
We have seen a lot of events not being delivered via Webhooks. Team is on it.
Posted Mar 08, 2024 - 08:37 UTC
This incident affected: Partner API.