On the 14th May 2024, an unexpected outage occurred in our cluster infrastructure, resulting in an approximate downtime of 1 hour and 30 minutes. The primary cause of the outage happened when the secondary file server attempted to promote itself to primary status during a cluster reconfiguration process. This unexpected event led to synchronisation issues and a loop of attempted promotions, eventually necessitating manual intervention to restore normal operation.
Timeline:
At approximately 06:25 UTC, A secondary file server was intentionally shutdown as part of a planned maintenance activity by our IaaS provider.
Around 06:38 UTC, secondary file server came back online following the planned shutdown.
In this process, the secondary file server was automatically activated as part of our internal recovery strategy.
Throughout this process, part of the file synchronisation failed unexpectedly causing a timeout to occur due to the volume size.
At approximately 08:18 UTC, manual recovery was completed by our development team to bring all the file servers back online and at this time services returned to normal operation.
Root Cause Analysis:
The outage stemmed from the automatic promotion of the secondary file server and primary server statuses following the restart of the planned shutdown. This unexpected change in configuration led to synchronisation issues and a loop of failed promotions, prolonging the downtime.
Mitigation Steps:
To prevent similar incidents in the future, we will implement the following measures:
Conclusion:
The outage was an unfortunate consequence of an unexpected cluster reconfiguration event. Through proactive measures and enhanced monitoring, we aim to minimise the risk of similar incidents in the future, ensuring the stability and reliability of our infrastructure.