Good morning, Conquer Voice users. Our team has identified an issue where some inbound calls to Conquer Voice DIDs are failing to complete. Audio on some calls may also be impacted. Users may also not be able to connect or stay connected. Our telecom team is actively investigating this now.
UPDATE 6:40am Pacific Time: Based on current information, dropped calls have also been identified as a symptom. Resolving all available telecom issues is a top priority and all available engineering resources are actively working to resolve as quickly as possible.
UPDATE 7:05am Pacific Time: Conquer Engineering continues to dedicate all resources to troubleshooting and resolution. As soon as more information is available, it will be posted here.
UPDATE 7:20am Pacific Time: Conquer Engineering has identified a potential solution and is working to deploy it now.
UPDATE 7:45am Pacific Time: Conquer Engineering has completed deploying the solution. They are continuing to monitor the system closely. Some users may be in a stuck state from the disruption and need to be cleared manually by Conquer Engineering or Support.
UPDATE 7:55am Pacific Time: Conquer Engineering observes that many users have been able to connect and place calls successfully, but they are continuing to work to ensure full system stability. Conquer Support is manually clearing users who are in a stuck state as quickly as possible.
ROOT CAUSE ANALYSIS
At 6:00am PDT Conquer Support was alerted by tickets reporting issues accepting Inbound calls. Engineering was engaged immediately, and at 6:31am PDT the primary NFS file server went completely down. This caused the other nodes in the redundant pairs to be in conflict, resulting in the primary file server not being allowed to rejoin the cluster properly.
This led to a mounting failing on our servers, causing Inbound Greetings, Music on Hold, and VM Drop-ins to fail. Additionally, as we tried to play music on hold, A legs for agents connection would be dropped due to a timeout when we could not play the file for music on hold. WebRTC Connections remained active.
Conquer Engineer’s disaster recovery plan started at 6:10am PDT to spin up a new cluster on a cloned drive as well as a parallel team working to restore the drive.
At 7:28am PDT the drive was restored, servers remounted, and normal functionality began to return.
The root cause of this issue was discovered to be a minor error, which occurred at 12:06AM PDT. The initial error caused our syncing process between our NFS file server to fail. Eventually, this error continued to the three redundant pairs of servers. The error was not flagged for our engineering team because the alerting/monitoring system was configured to only throw an error to when the file system failed/was no longer is serving files.
Moving forward, Conquer Engineering is going to revamp up our disaster recovery plan for the NFS redundant cluster to make sure we can recover from this failure more swiftly, should it occur. Quarterly drills will also be ran to make sure that we are keeping this plan up to date and accurately working. Additionally, development will begin immediately to ensure this type of error does not interfere with Leg A connections and is limited only to Voicemail Recordings and Voicemail drop-ins.