Good afternoon, Conquer Voice users. Our developers have identified an issue that is blocking users from connecting to the system. Two separate, unrelated services had issues simultaneously, and at the same time we were pushing the latest update: Salesforce experienced a service incident, and we identified an issue with our telephony provider as well. All hands are on deck investigating.
All connect attempts were queued. Users may experience several calls as the queued attempts are processed.
Updates and RCA will be posted as soon as they are available.
UPDATE 1:35pm Pacific Time: Our developers have applied a fix and are seeing performance improve. Slowness may persist for some users on their first connection attempt, but most should be able to reconnect after a page refresh and reconnect if currently connected. We are continuing to test and monitor the situation closely.
UPDATE 2:20pm Pacific Time: Our developers are seeing a new batch of errors and are working to address now.
UPDATE 2:25pm Pacific Time: The increase in errors from above has cleared. Development is still all hands on deck but is seeing service stabilizing.
On July 20, 2021 12:01p PT Conquer team ran a deployment across multiple services. While this deployment fully passed pre-deployment testing, a third party code library was updated between the time completing testing and the full deployment to production servers. This change to the the library caused TCP database connections in the Campaign worker services to become stale, messages to be delayed and ultimately slowed the Campaign service.
In addition, as many of you are likely aware based on the notification emails from Salesforce, SFDC had intermittent service disruptions beginning on 7/20 at 11:00a PT and lasting until 7/21 at 5:15a PT. These service events caused multiple issues for all users during the same period of time, including internally for the Conquer team, making it more challenging to identify the specific source of some of the errors and reports we were receiving.
The Conquer engineering team immediately caught the slow down through automated monitoring and began a full rollback, which was completed by 12:07p PT. However, the stale database connections were already in the stack, and the resulting slowness, combined with the SFDC incidents, led reps to flood the system with connection requests, call requests, etc. This caused a snowball effect which bled into other system components slowing down. We quickly scaled servers and services, but slowness persisted for some reps.
Once the root cause of the slowness was identified (stale database connections) and separated from any SFDC specific disruptions, we immediately flushed the system, purging messaging services and clearing the stale connections. This resolved the symptoms of general slowness and normal operation was restored. We then turned our attention to cause of the stale connections and found the change to the library. Since that discovery, the Conquer team has isolated that dependency to prevent it from impacting future releases.
Timeline: SFDC ongoing service issues: 7/20 at 11:00a PT to 5:15a PT on 7/21. Conquer Deployment, 12:01p PT. 12:07p PT, full rollback. 12:08p PT to approximately 1:00p, message services continued to back up, causing critical slowness from approximately 1:00p to 1:30p PT. The root cause was identified at 1:30p PT, and services were manually purged on an ongoing basis between 1:30p PT and 2:45p PT as we worked to understand how the stale connections were introduced. By 1:33p PT, error rates had returned to acceptable levels. At 2:03p PT, a full system purge lead to a slightly elevated error rate until 2:17p PT, at which time all systems returned to normal working levels.