How Uber Scaled Live Chat Support Share by 3500%
An in-depth look at Uber's struggles with live chat customer support and how they overcame it with new infrastructure
I am an avid reader, but not one who reads books.
I mostly like to read raw accounts of the problems that tech companies are facing today and how they are innovating to overcome those challenges.
And I've learned that the most successful companies are the ones that can look at a problem from every angle and find the solution that others might have overlooked.
That's exactly what Uber did when they realized that live chat, a support channel that had been largely neglected, held the key to transforming their customer support strategy.
It's a story that resonates deeply with me. I know firsthand the pressure to deliver exceptional support while keeping costs in check and the constant search for innovative solutions that can tip the scales in your favor.
So in this edition, I'll take you behind the scenes of Uber's incredible journey as they scaled their live chat support share (of the total contact volume) by a staggering 3500%.
I’ll share the key insights and decisions that drove this transformation, the technical challenges they faced along the way, and the lasting impact it had on their organization.
But first, some background.
Background
Like any other big tech company today, Uber serves millions of customers worldwide through various support channels, including live chat, phone, and in-app messaging.
Among these channels, chat has proven to be the sweet spot, offering a balance between high customer satisfaction rating (CSAT) scores and reduced cost-per-contact (CPC).
This channel allows for higher automation rates, improved staffing efficiency, and high first contact resolution (FCR), benefiting both Uber and its customers.
But from 2019 to early 2023, only 1% of all contacts were served via the live chat channel, while 58% were handled through the in-app messaging channel (a non-live channel).
To achieve higher CSAT and FCR, Uber's engineering team needed to scale the chat infrastructure and optimize the support process.
Challenges of scaling live chat infrastructure
The three main challenges were legacy architecture, lack of observability in the systems, and stateful services.
1. Legacy architecture limitations
The legacy architecture, built using the WAMP protocol for message passing and PubSub over WebSockets, faced reliability issues as traffic scaled beyond initial capabilities. 46% of events originating from customers trying to reach an agent were not delivered on time.
Horizontal scalability was not supported due to limitations with the older versions of the WAMP library being used, and upgrading required substantial effort.
2. Observability and debuggability issues
There were no observability mechanisms to track the health of chat contacts, making it difficult to identify and resolve issues promptly.
Chat contacts were not onboarded on the queue-based architecture, resulting in over 8% of the chat volume not being routed to any agent due to attribute matching flow.
End-to-end chat contact lifecycle debugging was not implemented, and the team was unable to accurately detect chat SLA misses on the platform overall.
3. Stateful Services
The services were stateful, complicating maintenance and restarts, which caused spikes in message delivery time and losses.
The added WebSocket proxy for authorization increased latency, and the double socket proxy caused issues when either side disconnected.
To overcome these challenges, the team set out to create a new architecture with the following goals:
It should facilitate end-to-end observability and troubleshooting in the entire chat flow.
It should prioritize stateless services, which are easier to scale horizontally and are less prone to data inconsistencies.
More than 95.5% of the messages sent through the system should be successfully delivered to their intended recipients.
The new architecture
The new chat architecture consists of a front-end UI used by agents and several back-end microservices that work together to provide a smooth chat experience for both customers and agents.
Source: Uber Engineering blog
Frontend UI and backend microservices
The front-end UI is where agents interact with customers, using various widgets and actions to investigate and respond to inquiries.
The back-end microservices include:
Contact Reservation: The Router service matches agents with the most suitable contacts and reserves the contact for the agent.
Push Pipeline: When an agent successfully reserves a contact, the matched information is sent to Apache Kafka. The frontend application receives this information in real-time through GraphQL subscriptions via a socket connection. It then loads the contact details along with the necessary widgets and actions, enabling the agent to respond to the user promptly.
Agent State: Keeps track of each agent's availability, which can be toggled using a switch on the frontend UI.
Edge Proxy: Acts as a firewall and proxy layer, routing connections between the client browser and back-end services.
Chat traffic is now onboarded to Queues, enabling features such as SLA-based routing, sticky routing (reconnecting customers with the same agent), and priority routing based on predefined rules.
Dashboards have been enhanced to provide real-time insights into Chat Queues SLA, agent availability, contact lifecycle states, queue inflow/outflow, and agent session counts.
At the core of the architecture lies GQL Subscription Service
The service leverages GraphQL subscriptions to handle reconnections efficiently in case of disconnections. Here’s how the team achieved this:
They implemented ping pong on the GraphQL subscription socket. If the connection isn't stable, the socket disconnects automatically.
When this happens, the agent won't receive new contacts. The socket attempts to reconnect on its own. Once successfully reconnected, all reserved or assigned contacts are fetched so the agent can accept them.
Moreover, the service prioritizes the reliability of the Push Pipeline. If an acknowledgment is not received from the frontend for a given agent, the service proactively attempts to reserve the contact for another available agent.
To ensure the proper functioning of both web socket and HTTP protocols for the agent's browser, the GQL Subscription Service uses heartbeat checks over GraphQL subscriptions and HTTP API calls, effectively verifying the agent's online status.
Now, they had to test whether the new architecture would meet their initial goals.
Issues encountered and fixed
During the testing of the new architecture’s performance, the team identified and resolved the following issues:
1. Deletion of browser cookies
Caused authentication failures, prevented the frontend from acting upon pushed events
Agents remained online without working on contacts
2. Bugs in auto-logout flows
Agents not being logged out due to out-of-order or missing events
Agents remaining online after closing tabs, resulting in increased customer wait times
They addressed these issues by automatically logging agents out based on recent acknowledgment misses and tracing logouts to the right causes to improve confidence in the system.
How things changed for the better
As of February this year, 36% of the overall Uber Contact volume that is routed to Agents is now handled through the Chat channel. That’s a 3500% increase from the previous share, which was just 1%.
There are also massive improvements in reliability. In the old system, the error rate for delivering contacts was a troubling 46%. Now, it’s just 0.45%.
The new architecture is simpler, with fewer services and protocols to manage. Plus, better observability means the team has greater visibility into key metrics like contact delivery, system delays, and end-to-end latency.
Isn't this whole case study a brilliant example of how a company can turn a challenge into an opportunity?! The next time you find yourself facing a challenge, take a page from Uber's playbook.
Embrace the problem, explore every angle, and don't be afraid to take a chance on a solution that others might have overlooked. Because who knows? You might just find yourself at the forefront of the next big transformation in your industry.