What happened
Since Friday, 08.03.24, at around 8am we had issues with delivering webhook events on Partner API. We had minor issues during the week which we were able to address by improving configuration.
Over the weekend we learned that our pending webhooks were growing. Monday morning we were able to deploy a fix. However, the amount of pending webhooks waiting for delivery was too much for our system/configuration.
Why did it happen
Huge amount of location updates of charge points via OCPI did result into hundreds of thousands changes (webhook events) - this led to a chain reaction:
- Bad configuration didn’t allow our system to process this amount of messages
- A missing (or: incorrect) index in our table surfaced due to the amount of messages
Our alerting on webhooks monitors failed webhooks – not pending webhooks
Impact
- Webhooks were not or rarely delivered in the time from 08.03.24 8am to 11.03.24 2pm
- Webhook events between 08.03.24 and 10.03.24 11am which were not delivered, got cleared (deleted) by our automation (we only keep webhook events for 24 hours)
Mitigation
- Synchronize relevant entities using the Partner API to ensure your view on our system is up to date. Especially any pending charges or wallet-transactions.
Action Items (Monta)
- We have used this “load-test” to significantly improve processing of webhooks in our system
- We have set-up monitoring on pending events as well
- Our clean-up job does not delete any items that are in state “Pending” moving forward