Resolved -
We apologize for the delay in updating this issue, but we have good news!
After resolving the RabbitMQ outage yesterday -- we started looking at all recovery options for any Tasks not ran during the outage.
First were any Tasks stuck in pending (there were ~11k Tasks stuck in this state), and second were any webhooks (from instant Zaps) that were completely missing (there were about ~139k of these).
After verifying and shipping a few performance improvements to our recovery mechanisms, we're happy to say that we've completed recovery of Tasks impacted during the 15 minute outage yesterday (between 2017-06-02 19:07:00 UTC and 2017-06-02 19:22:00 UTC). If you continue to experience missing Tasks during the window, please contact support via
[email protected].
We'll be making some changes to how our RabbitMQ cluster behaves, as well as some further speed improvements for recovery mechanisms to both prevent the impact of future outages and speed up our response time in recovering any Tasks during future outages.
Jun 3, 11:53 PDT
Update -
Everything looks resolved. We're doing some final investigation into possible lost Tasks and recovery options.
The hard outage lasted about 15 minutes, and the ramp up to recovery lasted about hour or so.
Jun 2, 13:58 PDT
Monitoring -
Things look stable again, we're ramping up Zap speed back to normal, everything should be fine shortly. We'll have more information on possibly lost Tasks and recovery efforts after we get everything running smoothly again.
Jun 2, 12:58 PDT
Identified -
The issue has been identified and a fix is being implemented.
Jun 2, 12:32 PDT
Update -
We have identified the outage, it is isolated to a single RabbitMQ node responsible for queueing tasks. We've temporarily paused Tasks as we resolved the outage, and are working to bring back all Tasks. More info soon.
Jun 2, 12:32 PDT
Investigating -
Tasks are not running while we looking into a possible queueing outage. More info soon.
Jun 2, 12:17 PDT