How do I investigate for the root cause of database (db) connection errors (max_user_connections/drizzle state read:lost connection to server/unable to configure connection) logged to a site/interface error log?
Oracle Service Cloud (OSvC)
The presence of database (db) connection errors (max_user_connections/drizzle state read:lost connection to server/unable to configure connection) logged to a site/interface error log is an indication that the configured limit of concurrent db connections used by an OSvC site/instance has been reached. With no remaining connections from the available pool to fulfill requests, this condition can cause site-wide errors due to all aspects of the site being unable to connect to the database. Investigating for what is the root cause of all of the trouble can be a complex task, and understanding the common causes and resolution of such problems, as well as how to work with the Technical Support team while investigating, can be very helpful.
Database connection errors will occur when product-level safeguards protecting against overuse of the db resource are insufficient to appropriately limit the number of concurrent db connections in use.
What safeguards are designed into the product to avoid performance degradation and db connection problems?
These safeguards include the following (including the max_user_connections configuration itself):
- The OSvC service is a shared resource, and sites are initially provisioned to handle expected capacity. Capacity constraints can be adjusted by our GNC team according to changing capacity needs.
- Sites are protected by the hidden configuration (max_user_connections) used to indicate the maximum number of concurrent db connections allowable by a single site/instance. This number is 150 for all sites without exception.
- Maximum join configurations limit queries made on the site/instance db (limits queries to scan a configured number of db rows under specific and identified circumstances)
- ROQL blacklisting functionality is used in conjunction with maximum join configurations to limit queries made on the site/instance db (blocks/blacklists or diverts ROQL queries to replication)
- Product feature-level timeouts limit the amount of time a db connection can be held by a particular service request (SOAP API for example)
When are these safeguards insufficient?
The safeguards put in place are sometimes not enough to protect a site from hitting the max_user_connections configured limit. It can be helpful to consider that Service Cloud instances/sites can process a massive amount of data and transactions in a given time. Further, it happens to be that often the cause of db connection problems are less a problem with overall site/instance capacity utilization but rather the result of a single aspect of the site configuration/customization that is able to hold all 150 of the available connections alone under certain circumstances.
How should I go about investigating db connection errors?
The first thing to do when investigating db connection issues is to confirm the error using the error log of the site/interface, or otherwise identifying such an error in a server-side trace file (by Oracle resources with access). Be aware that if one interface is showing the error, then all interfaces receiving activity at that same time are likely also receiving the same type of error showing up across all uses of the product db (for the site/instance). For this reason it is helpful to be aware that most, if not all, of the errors logged (indicating db connection problems) will be specific to the symptom of the problem rather than the cause. For this reason it is helpful to understand that some product areas that more likely to be associated with such a problem.
To investigate max_user_connections errors we want to evaluate the following:
- Is there anything specific configured/developed on the site that will wait on another (external) system while holding a db connection – where concurrent db connections can stack up when there is increased latency experienced in system calls? Site administrators/developers on the customer side are often the best resources to be able to pinpoint this possibility.
- Is customization/configuration of the site following prescribed best practices for public API usage? For further details see
- Are there analytics reports being run by too many users at one time – such as with complex reports (complex enough to require relatively significant run time but not scanning too many rows to encounter max join limitations) used frequently enough in specific periods of time? This can involve reports run in CP that become a problem when CP is under relatively high load. This can also happen with reports run in the agent console, including agent dashboard reports if all agents login about the same time, for example. Another scenario where reports can cause max join errors is when reports are set to refresh frequently (such as every 5 seconds).
- Is the database server experiencing performance degradation? This is a concern because slower response times mean that db connections are held for a longer period of time to do the same amount of work – and thus can tip the scales in terms of concurrently used db connections. If performance degradation is observed and there are no known site changes and/or volume increase on the site itself there is a possibility that another customer on the same db server is causing the degradation. In such cases it is advised to create a Service Request with the Technical Support team with a request that the cause of the performance degradation be determined. For further details see
What other considerations should be made when investigating max_user_connections errors?
Appropriate optimization of the db will generally cause queries to perform better, thus reducing the time that processes hold onto a db connection for a specific action involving the running of said queries. For further details on db optimization see
Finally, there are rare cases where the investigation of db connection errors are found to involve a potential product defect or enhancement. Investigations of this nature can lead to improvements to the performance and/or scalability of the product., and can include such things as enhanced caching to fulfill requests of a certain type etc.. This possibility is an additional reason to engage Technical Support through a Service Request if database connection issues are causing problems for a site.