Maropost Commerce Disruption
Incident Report for Neto by Maropost
Postmortem

On the 14th May 2024, an unexpected outage occurred in our cluster infrastructure, resulting in an approximate downtime of 1 hour and 30 minutes. The primary cause of the outage happened when the secondary file server attempted to promote itself to primary status during a cluster reconfiguration process. This unexpected event led to synchronisation issues and a loop of attempted promotions, eventually necessitating manual intervention to restore normal operation.

Timeline:

At approximately 06:25 UTC, A secondary file server was intentionally shutdown as part of a planned maintenance activity by our IaaS provider.

Around 06:38 UTC, secondary file server came back online following the planned shutdown.

In this process, the secondary file server was automatically activated as part of our internal recovery strategy. 
Throughout this process, part of the file synchronisation failed unexpectedly causing a timeout to occur due to the volume size.

At approximately 08:18 UTC, manual recovery was completed by our development team to bring all the file servers back online and at this time services returned to normal operation.

Root Cause Analysis:

The outage stemmed from the automatic promotion of the secondary file server and primary server statuses following the restart of the planned shutdown. This unexpected change in configuration led to synchronisation issues and a loop of failed promotions, prolonging the downtime.

Mitigation Steps:

To prevent similar incidents in the future, we will implement the following measures:

  • Implement stricter control mechanisms to prevent automatic promotions without manual verification.
  • Enhance monitoring systems to promptly detect and alert when cluster configurations deviate from the expected state.
  • Conduct thorough testing and simulation exercises to validate the failover and synchronisation processes under various scenarios.

Conclusion:

The outage was an unfortunate consequence of an unexpected cluster reconfiguration event. Through proactive measures and enhanced monitoring, we aim to minimise the risk of similar incidents in the future, ensuring the stability and reliability of our infrastructure.

Posted May 15, 2024 - 18:22 AEST

Resolved
The issue has been resolved and services are operating as normal.
Posted May 14, 2024 - 20:07 AEST
Update
Services have returned to normal. We will continue to monitor the situation.
Posted May 14, 2024 - 19:01 AEST
Monitoring
We are receiving reports of services returning back to normal. We are continuing to monitor the situation.
Posted May 14, 2024 - 18:25 AEST
Update
Our Development Team is actively working to resolve the issue as soon as possible.
Posted May 14, 2024 - 18:15 AEST
Identified
We have identified the issue, our Development Team is working to resolve the issue as soon as possible.
Posted May 14, 2024 - 17:36 AEST
Investigating
We are receiving reports of sites showing "We will be back shortly". Our Development Team are investigating now
Posted May 14, 2024 - 17:09 AEST
This incident affected: Merchant Store Fronts, Control Panel, Point Of Sale Registers, and Batch Processing.