Understanding Rate Limiting: Strategies and Solutions

With the advancement of technology and the interconnection of various online systems, the amount of data transferred through web servers is rapidly increasing. Web services provide mass access to data via an Application Programming Interface (API). However, the increased volume of queries also requires a certain level of protection against various intentional or unintentional abuses, including excessive request frequency. To protect those services, we can use rate limiting, which is implemented using multiple algorithms. Clients sending a large number of requests to the web servers must, therefore, manage the limitations themselves.

Databox supports data retrieval from a large number of web services, with more than half implementing one or more different request-limiting methods. Multiple million requests are sent daily to the servers with such limitations. Scheduling these requests within limitations had been a challenging task that required multiple trips from the drawing board to production. While the system is ever-evolving, it can now effectively handle client-side rate limiting at our scale.

Keep reading to learn how we approach rate limiting at Databox.

What is Rate limiting?

Rate limiting is a technique used to control the rate of incoming and outgoing requests to and from a network, server, or application over a specified period of time. By limiting the number of requests or actions in a given timeframe, rate limiting ensures that the system remains stable under heavy load. Rate limiting is commonly implemented in APIs, web servers, and other network services to prevent abuse, manage resource usage, and provide fair access to all users. As described by Cloudflare, it is also an effective solution for managing bot activity.

Reactive vs. Proactive Rate Limiting

From the client’s perspective, request limiting can be implemented reactively or proactively.

What is Reactive Rate Limiting?

Reactive rate limiting means responding to traffic spikes or abuse after they occur, blocking or throttling requests after the defined limits are exceeded.

An example of reactive request limiting on the client side is responding to the “rate limit exceeded” error from the web service, often requiring to wait for a specified time before retrying. In many cases, this necessary waiting time is not provided in the error feedback, and exponential backoff is implemented to mitigate rate limits.

What is Proactive Rate Limiting?

Proactive rate limiting means consistently enforcing predefined limits to anticipate and prevent traffic spikes or abuse. An example of proactive rate limiting is delaying the requests before the whole available resource is used up.

At Databox, we focused on the proactive approach. By carefully estimating when requests can be made, we are mindful of the limitations and can ensure higher data delivery rates. This ensures that we do not excessively burden servers and prevent errors from occurring in the first place, which additionally affects user experience.

In order to act proactively, the developer must thoroughly analyze, research, and describe the integration’s rate-limiting methods before actual implementation. At Databox, we have developed our own global format for recording these limitations for each integration.

Rate-limiting Algorithms

The three most common rate-limiting algorithms used in our integrations are Fixed Window, Sliding Window, and Concurrent Requests. For those familiar with these methods, skip to section 4. Understanding these algorithms influenced the design phase of our solution.

Fixed Window Rate Limiting

Fixed window rate limiting is a simple approach where you restrict the maximum number of requests (M) a client can make within a specific time window (w). For example, you might allow a client to make only 1000 requests per hour. Once the limit is reached (i >= M) within that minute, further requests are denied until the window resets.

The request is usually denied with a 429 Too Many Requests. As previously explained in our article about API integration, API providers we integrate with often don’t follow the guidelines, so we have to detect rate limit errors among a variety of status codes and response payloads.

This method is straightforward but can lead to bursts of traffic if requests are made in quick succession near the window boundary. As we’ll see in the next section, other limiting methods are usually in place to handle such short bursts.

Sliding Window Rate Limiting

Sliding window rate limiting is an approach that smooths out the number of requests to the server over time. Similar to the fixed window method, the sliding window also uses a time window or interval in which we define the maximum number of requests. However, unlike the fixed window, the sliding window does not “reset” at the end of the time window; instead, it moves or slides over time with a specified “step.” Since the method’s purpose is preventing bursts, the window size w is mainly given in shorter units (1s, 10s, 60s, 100s, etc.). The end of the time window is always the current moment (request timestamp).At the time of each request, counter i represents the number of requests executed in the window up until the moment of the request.

There are several possible server-side implementations of this method. Those include:

Storing all request timestamps. To get the current state, SUM all requests falling in the window w when a new request is received.
Aggregating requests into shorter windows to reduce required storage. This also reduces accuracy. To get the current state, sum a smaller number of related counters when the request is received.
Maintaining a single counter i representing the current state. Deduct one request falling out of the window from the counter, and add new requests to the counter. Current stats are always available; maintaining it requires deductions.

SLiding Window Rate Limiting — Sliding Window Rate Limiting

Concurrent requests

Limiting concurrent requests is the third most common method among APIs that Databox integrates. This method limits us to a maximum (M) number of requests that can be running simultaneously. On this occasion, the server is aware of currently active requests, reports being generated, and analytics being prepared. The method is briefly described in the image below.

Common denominator

While designing our client-side rate-limiting system, engineers quickly stumbled upon an issue. Concurrent requests aside, the APIs we integrate seldom let you know the exact algorithm for rate limiting. Documentations usually provide a maximum number of requests (M) and a limit window (w), for example, 1200 requests/minute. We needed to know the exact method being used to find a common denominator that would support both of them.

In the sliding window, we always calculate the current state (i) based on the current timestamp and the window size. In the fixed window example below (w = 1 minute), we use the same window size, but the current timestamp is rounded to a full minute as a reference.

We identified a slide size (s) for both methods. In the fixed window method, the window “slides” every minute, equal to the window size. On the sliding window, the window slides every millisecond (or whichever is your shortest time unit of choice). We visualized a 200ms slide size in the image below for easier visibility.

This brings us to the following conclusions:

Both methods can be described using three parameters (M, w, s).
Both methods have identical throughput over the same duration.
A fixed window can be safely described as a sliding window.
Only the sliding window has to be implemented to support both methods.

How to Implement a Client-Side Rate Limiting System

Some integrations don’t rate-limit the number of requests. Instead, they limit “tokens” or “units,” where each request can consume a different amount based on request complexity (e.g., date range length, number of filters, number of attributes). Therefore, we shifted our terminology from rate-limiting executed requests to rate-limiting consumed units. In cases where integration does limit the number of requests, the consumption is set to one unit per request.

To accurately stay within the limits imposed by integrations, we introduced two concepts: Consumptions and Reservations.

Consumptions track how many rate-limit units were consumed on a timeline. We use those later on to:

Calculate the current rate limit state and the number of units that are available to be spent at any given moment
Approximate the timestamp at which a request consuming a specific amount of units can be executed

Reservations represent rate limit units that are currently reserved. Those are the requests that are currently being executed (waiting for a response or about to be executed shortly)

How to track reservations

We’re tracking active requests separately from consumptions as reservations, due to concurrent requests rate limits. Those requests cannot be accurately described on a timeline since each request can have a different response time. There’s no defined window size (w). Here’s a simple visualization of a dictionary containing reservations for all currently active requests. Consumed units are estimated in advance since some integrations consume different numbers of units based on the response complexity.

The integration engineers carefully set the estimated consumptions by studying the documentation. They also track and monitor consumed units and rate limiting errors over time to fine-tune these estimations.

Once the response is received, the reservation is converted to a consumption with actual values. Reservations and actual consumptions are taken into account together when checking the rate limit boundaries.

How to track used units

After each request is executed, we track the amount of units that the request spent. As mentioned, some integrations only let you know how many units were spent in the response. That’s why tracking actual consumptions happens after the response is received. We execute about 24 million API requests daily.
We decided to break down each rate-limiting window into buckets to save some storage and computing resources. Those buckets are effectively the same size as the step parameter (s) described in the methods in previous chapters. Each bucket has its counter, indicating how many units were consumed in that bucket window. Bucket size (and step size) is dynamic based on the rate limit window size. We’ll track shorter limits more accurately (e.g., 1-second rate limit is bucketed by 10ms) and longer limits with more extended bucket sizes (e.g., 1-hour rate limit is bucketed by 1s). The figure below shows an example with longer buckets for visualization purposes.

Computing from fewer buckets is faster while still maintaining decent accuracy. When the response is received, and the actual consumption is calculated, we store the consumption in the correct bucket based on the request’s timestamp.

Calculating the current state

With both reservations and consumptions in place, we’re fairly accurately able to figure out whether a new request can be executed at the moment. We’ll do that by summing consumptions in the rate-limiting window and the currently active requests in the reservations. The example in the figure below shows a rate limit of 10 units / 1 minute (M / w)

Calculating current state in Rate limiting

To calculate the current state, we sum relevant consumption buckets (c = 1 + 6 + 1 = 8) and the currently active reservations (r = 1). Our system indicates that the current state (i) is, therefore, 9 (i = c + r). A request with a consumption of 1 can therefore be made (i + 1 <= M → true), while a request consuming, e.g., 5 units, cannot be made (i + 5 <= M → false). After the request is executed, our state would look like this:

Calculating the next available execution window

Monitoring the current state allows us to stay within limits at any moment. We aim to execute the requests quickly to provide the best possible user experience, but that could mean simply recalculating the current state every couple of milliseconds. However, some rate limits are hourly or daily, and once those are consumed, all further state checks unnecessarily burn the system resources. Therefore, we introduced a feature that calculates the next available window to execute the said request.

As the figure below shows, we’ll simulate the window sliding into the future using the step parameter (s). That will ultimately put older buckets out of the picture and free up some available units. That’s the next estimated window for executing the request. The example shows the same 10 units / 1 minute rate limit. Sliding the window into the future twice (now + 2s) gives us a consumption sum of 9, which is under the limitation. Therefore, we’ll be able to consume that available unit at that time.

Calculating the next available execution window

Integrating All Components for a Simplified Solution

By wiring together the components, we’ve built a robust, client-side rate-limiting system capable of handling millions of API requests daily. The system works by tracking the real-time consumption of rate-limit units, making proactive reservations for upcoming requests, and calculating when the next available request window opens up. With this approach, we can ensure that requests are sent efficiently, staying well within rate limits while also optimizing resource usage. This keeps our operations smooth and helps deliver a better overall user experience.

Did you know that the concept of rate limiting is not new? It dates back to the early days of telecommunication when operators had to manage call volumes to prevent network congestion.

Today, maintaining the stability and efficiency of web services is no longer possible without effective rate limiting. At Databox, we’ve devised and implemented a unified rate-limiting solution that efficiently handles millions of API requests daily, regardless of the underlying algorithm—whether it’s Fixed Window, Sliding Window, or Concurrent Requests—used by the APIs. It has not only safeguarded our infrastructure but also enhanced user experience by minimizing errors and maximizing data delivery rates.

Rate limiting at Databox is part of a series of technical articles that offer a look into the inner workings of our technology, architecture, and product & engineering processes. The authors of these articles are our product or engineering leaders, architects, and other senior members of our team who are sharing their thoughts, ideas, challenges, or other innovative approaches we’ve taken to constantly deliver more value to our customers through our products.

Aljaž Mislovič is a Software Architect at Databox, specializing in designing scalable backend systems, integrations, and data ingestion pipelines. In his role, Aljaž focuses on optimizing the performance and reliability of our product, ensuring seamless data flow and integration from multiple sources. His main contribution is architecting solutions that support data-driven decision-making for our clients, enhancing their ability to leverage insights effectively.

Stay tuned for a stream of technical insights and cutting-edge thoughts as we continue to enhance our products through the power of data and AI.

Understanding Rate Limiting: Strategies and Solutions

Table of contents

What is Rate limiting?