Good afternoon, Conquer Voice users. Conquer Engineering has identified a provider issue that appears to result in loading issues and dropped calls for some Conquer Voice users. All resources are working to address with the provider. More details will be posted when they become available.
UPDATE 2:15pm PT: Conquer Engineering has engaged the provider and they are working to provide a solution as quickly as possible.
UPDATE 2:25pm PT: Conquer Engineering has confirmed that Email and Cadence services are also impacted. The title of this post has been updated accordingly.
UPDATE 3:00pm PT: Conquer Engineering has confirmed that services are coming back online. A RCA will be added to this post once it is available.
ROOT CAUSE ANALYSIS:
On April 11, 2024, at 1:32 PM PST, Conquer’s MongoDB experienced a total failure, leading to a widespread service disruption across the Conquer Voice and Email platforms. Service was restored by 2:32 PM PST after the database recovery, and by 2:50 PM PST, all functionalities were confirmed operational.
Conquer’s MongoDB is hosted across two different regions by a managed provider to maintain high availability. Normally, traffic would automatically reroute to the secondary database during an outage of the primary database. However, both databases were compromised due to a simultaneous issue caused by an erroneous manual configuration change made by our cloud provider, which disrupted routing.
Although Conquer maintains daily backups of the database externally, deploying a new instance would have taken approximately three hours. Instead, we collaborated with the cloud provider to repair both the primary and secondary databases to minimize downtime.
Acknowledging the severity of a one-hour downtime, Conquer is taking steps to enhance system resilience. We are establishing a tertiary instance with a different cloud provider to receive shadow updates from the primary database. This setup will ensure an immediate failover to the tertiary database should both primary and secondary databases fail, reducing potential downtime to less than 30 seconds. Additionally, we are shortening our live backup processes to ensure that in a complete failure of all three instances we are able to spin up new instances in less than 15 minutes.
Comments
0 comments
Please sign in to leave a comment.