How to Debug CPU Overuse in Ruby Applications

5 minute read

I will start by providing some context about our system, which I call System X, then describe the steps taken to discover the offending process. My hope is that you can leverage this debugging process in the future.


We run a dockerized Ruby service on a Puma web server. Many of our services use Docker, Ruby, an alpine base image and MariaDB.

Before migrating to AWS, we used Prometheus for standard and custom metrics. One such metric, cpu_user_seconds_total, provided by Prometheus via CAdvisor, indicated the cumulative user CPU time consumed in seconds.

As explained here, system CPU seconds refers to the amount of CPU time used by the kernel, for tasks like interacting with hardware, memory allocation, communicating with OS processes, and managing the file system. User CPU seconds refers to time used by user space processes, like those initiated by an application or a database server, or by anything other than the kernel. For simplicity, I will refer to it as CPU utilization, as that’s what linux intends this to measure.

None of our services came close to utilizing all of it’s allotted CPU, except one…


On our production environment, the CPU utilization of the containers hosting Service X increased until it reached 100 percent. This would happen after 3 days, at which point we’d have to restart the service. The CPU utilization increased with the amount of Puma threads and HTTP requests.

Other services, some of which handle many more requests than Service X without issue, have the same dependencies with the same versions as Service X. That, coupled with the fact that user CPU was increasing, led me to believe that our application code had a problem.

Investigation process

Phase 1: Pinpoint the problematic request

Depending on the pattern of the spike, the offending process will not always be a completely synchronous, simple HTTP request. It could be a scheduled job or the processing of a message from a queue. Our pattern suggested that sustained requests to two problematic endpoints were causing the spike. The majority of requests were hitting a GET endpoint which retrieves a resource. The culprit was likely line of code hidden in that endpoint.

Phase 2: Replicate the issue on a testing environment

To verify that these endpoints were causing the spike, I ran a series of load tests on each of them, using Locust. Firing the same amount of requests to the suspect GET endpoint on our testing environment yielded the same spike! And, most importantly, the graph in CPU usage time on testing mirrored the one on production: the curve looked the same and the container also took 3 days to become unusable. To speed up the debugging process, I increased the amount of Locust requests, so that CPU usage time would spike faster:

1 gCYyLxhuupzGyoH3ZG1OKg

Phase 3: Locate the offending lines of code

Now came the meat of the investigation: finding that line.

For illustration purposes, let’s say we are a note-taking app, and that the resource we are GETting will be a “notebook”. A customer can only GET his or her notebooks (not others’) and has to upgrade to Premium to be able to perform other actions, like deleting notebooks. Each notebook has an identifier, attributes and some related resources (collaborators, notes, etc).

Our application uses Grape and employs a service-repository pattern.

The request to GET /notebook/:id gets routed to a Grape class that defines a get method. That method instantiates a service class (a class responsible for the application’s business logic), passing any relevant data from the request headers. That data has been wrapped in a context object by Rack middleware earlier. It then asks the service instance to find a notebook with the id from the request parameters.

Context comes from request headers that are inserted by a different application before the request hits our notebook application. It contains the ID of the customer (so we only return a notebook belonging to that customer), the actions that customer is allowed to perform (can they get and delete notebooks? Just get them?) and the request id for logging purposes:

