Maropost Commerce Disruption

Incident Report for Neto by Maropost

Postmortem

On the 14th May 2024, an unexpected outage occurred in our cluster infrastructure, resulting in an approximate downtime of 1 hour and 30 minutes. The primary cause of the outage happened when the secondary file server attempted to promote itself to primary status during a cluster reconfiguration process. This unexpected event led to synchronisation issues and a loop of attempted promotions, eventually necessitating manual intervention to restore normal operation.

Timeline:

At approximately 06:25 UTC, A secondary file server was intentionally shutdown as part of a planned maintenance activity by our IaaS provider.

Around 06:38 UTC, secondary file server came back online following the planned shutdown.

In this process, the secondary file server was automatically activated as part of our internal recovery strategy. 
Throughout this process, part of the file synchronisation failed unexpectedly causing a timeout to occur due to the volume size.

At approximately 08:18 UTC, manual recovery was completed by our development team to bring all the file servers back online and at this time services returned to normal operation.

Root Cause Analysis:

The outage stemmed from the automatic promotion of the secondary file server and primary server statuses following the restart of the planned shutdown. This unexpected change in configuration led to synchronisation issues and a loop of failed promotions, prolonging the downtime.

Mitigation Steps:

To prevent similar incidents in the future, we will implement the following measures:

  • Implement stricter control mechanisms to prevent automatic promotions without manual verification.
  • Enhance monitoring systems to promptly detect and alert when cluster configurations deviate from the expected state.
  • Conduct thorough testing and simulation exercises to validate the failover and synchronisation processes under various scenarios.

Conclusion:

The outage was an unfortunate consequence of an unexpected cluster reconfiguration event. Through proactive measures and enhanced monitoring, we aim to minimise the risk of similar incidents in the future, ensuring the stability and reliability of our infrastructure.

Posted May 15, 2024 - 18:22 AEST

Resolved

The issue has been resolved and services are operating as normal.
Posted May 14, 2024 - 20:07 AEST

Update

Services have returned to normal. We will continue to monitor the situation.
Posted May 14, 2024 - 19:01 AEST

Monitoring

We are receiving reports of services returning back to normal. We are continuing to monitor the situation.
Posted May 14, 2024 - 18:25 AEST

Update

Our Development Team is actively working to resolve the issue as soon as possible.
Posted May 14, 2024 - 18:15 AEST

Identified

We have identified the issue, our Development Team is working to resolve the issue as soon as possible.
Posted May 14, 2024 - 17:36 AEST

Investigating

We are receiving reports of sites showing "We will be back shortly". Our Development Team are investigating now
Posted May 14, 2024 - 17:09 AEST
This incident affected: Merchant Store Fronts, Control Panel, Point Of Sale Registers, and Batch Processing.