Designing and Implementing an API Rate Limiter in Java

Kishan Kumar
15 min readJul 13, 2024

--

Photo by Chris Liverani on Unsplash

In this article, we’ll go over the basics of what an API rate limiter is. We’ll answer fundamental questions such as why we need a rate limiter and how we can design and implement it. This will be a hands-on guide where we put our knowledge into practice and build something useful.

Let’s start by answering the very first question:

What is a Rate Limiter?

A rate limiter is a service that ensures no one can perform any operation if they have exceeded their quota within a certain time duration. At a high level, it restricts requests by returning an HTTP 429 (Too Many Requests) status code.

It is usually considered a service that protects downstream services from being overwhelmed by a single client. For example, when you are entering your OTP during an online payment and you enter three wrong OTPs consecutively, you are no longer allowed to request an OTP or make any transactions for a certain time frame. But why is this done? What good does it do to prevent a user from entering the OTP?

From a hacker’s perspective, it is possible they are trying random OTPs to get through the transaction. What if it’s not even a person but a script? Rate limiting is done to safeguard both the user and the services that are sending OTPs to random scripts.

In short, a rate limiter limits how many requests a sender can issue in a specific timeframe. If the number of requests exceeds that limit, it blocks or throttles them.

Why Do We Need It?

As explained above, protection from bad clients is one use case. Hackers can run scripts that try to brute force passwords until they gain access to your account. They can also perform a Denial-Of-Service (DoS) attack on a server. Without a rate limiter, services wouldn’t be available to legitimate users. A rate limiter prevents DoS attacks, either intentional or unintentional, by blocking the excess calls. For e.g, Twitter limits the number of tweets a user can send in a 24-hour period. Recently they also announces how many tweets a user can see during a 24 hour window.

Twitter Rate Limiter

In some cases, you may want to sell your services via an API and charge for it. For example, OpenAI allows 4 or 5 free credits to use their new GPT-4 in a day. That is a form of rate limiting. If you pay for their subscription, that limit is removed.

Some businesses use rate limiting to prevent bots from accessing their resources. Some also use machine learning on top of rate limiting to determine if a request is from a genuine user or a bot.

Another use case could be to reduce cost. Rate limiting is extremely important for small start-ups who are charged on a per-call basis by big companies like Amazon for offering their web services. Thus limiting the number of such API calls is essential to reduce costs.

Let’s now understand the requirements and goals of the system:

Requirements and Goals of the System

Functional Requirements (FR):

  • The system should limit the number of requests sent by the client within a certain time duration if it exceeds a certain threshold, such as no more than 10 requests per second.
  • The system should work in a distributed environment.
  • The system should inform the users who are throttled.

Non-Functional Requirements (NFR):

  • It should be highly available since all the calls to the downstream services must first pass through the rate limiter. If the rate limiter itself is unavailable, we won’t be able to forward traffic to the downstream services.
  • High fault tolerant. If there are any problems with the rate-limiter it shouldn’t affect the entire system
  • The rate limiter should not add substantial latency as it will impact the user experience.

Types of Rate Limiting

Rate limiting can be divided into three types:

  1. Hard Limit: The number of requests cannot exceed the defined limit (throttle).
  2. Soft Limit: A buffer is provided, usually given as a percentage. For example, if our threshold is set at 10 requests per second with a 20% buffer, it means we can allow a maximum of 12 requests.
  3. Elastic or Dynamic Limit: The limit can be hard or soft depending on the resources. If we have resources to handle more requests, we can switch to a soft limit; otherwise, we use a hard limit.

Let’s talk about some algorithm by which we can implement the Rate limiter

Algorithms to Implement Rate Limiting

Rate limiting can be implemented using different algorithms, and each of them has distinct pros and cons. In the following section we will try to understand each on of them using examples.

Fixed Window Algorithm

In the Fixed Window Algorithm, time is divided into fixed intervals or windows, such as seconds, minutes, or hours. Each window has a maximum number of requests that can be processed. Once the limit is reached within a window, any further requests are blocked or throttled until the next window starts.

Example: If the rate limit is set to 3 requests per minute, and the time window starts at 12:00:00, then between 12:00:00 and 12:00:59, only 3 requests will be accepted. At 12:01:00, the counter resets, and another 3 requests can be accepted.

Fixed Window Algorithm

Advantages

  • Simple to implement.
  • Easy to understand and manage.

Disadvantages

  • It can cause spikes at the boundaries of the windows. For instance, if 2 requests are made at after 12:00:30 and another 3 requests are made before 12:01:10, effectively, 5 requests are processed within that period, which could overwhelm the system.

Sliding Window Algorithm

The Sliding Window Algorithm, also known as the Rolling Window Algorithm, addresses the drawbacks of the Fixed Window Algorithm by allowing a continuous, moving window of time to limit requests. Instead of resetting at fixed intervals, the sliding window checks the rate limit based on the last request timestamp and a rolling count of the requests within the window duration.

Example: If the rate limit is set to 3 requests per minute, the algorithm maintains a rolling count of requests within the last 60 seconds at any given time. For each new request, it checks if the number of requests in the past 60 seconds exceeds the limit. If yes, the request is blocked; if no, the request is processed and the count is updated.

Advantages

  • Provides a more even distribution of requests over time, preventing spikes at the boundaries.
  • More efficient and effective in handling bursty traffic.

Disadvantages

  • Slightly more complex to implement compared to the Fixed Window Algorithm.
Sliding Window Algorithm

Token Bucket Algorithm

Another popular algorithm is the Token Bucket Algorithm, which allows for a certain burstiness in request rates while enforcing a rate limit over a longer period. Tokens are added to the bucket at a constant rate, and each request consumes a token. If there are no tokens left, the request is blocked or throttled.

The token bucket algorithm takes two parameters

  • Bucket size: the maximum number of tokens allowed in the bucket.
  • Refill rate: the number of tokens put into the bucket every second

Amazon and Stripe use this algorithm.

Example: If the rate limit is set to 3 requests per minute, tokens are added to the bucket at a rate of 1 token every 20 seconds. The bucket can hold a maximum of 3 tokens. When a request comes in, it consumes a token. If the bucket is empty, the request is blocked.

Token Bucket Algorithm

Advantages

  • Allows for short bursts of traffic while still enforcing an overall rate limit.
  • Can be adjusted to handle varying traffic patterns.

Disadvantages

  • More complex to implement and manage compared to the Fixed Window Algorithm.

Leaky Bucket Algorithm

The Leaky Bucket Algorithm is similar to the Token Bucket Algorithm but focuses on smoothing out the traffic. Requests are added to a queue (bucket) and processed at a fixed rate. If the queue is full, incoming requests are dropped.

It is usually implemented with a first-in-first-out (FIFO) queue. The algorithm works as follows:

  • When a request arrives, the system cehck if the queue is full. If it is not full, the request is added to the queue, otherwise, the request is dropped.
  • The requests in the queue are processed at a fixed rate.

The leaking bucket algorithm takes the following two parameters:

  • Bucket size: the maximum number of requests allowed in the queue.
  • Outflow rate: the number of requests that can be processed at a fixed rate by the services.

Shopify, an e-commerce company, uses leaky bucket algorithm for rate-limiting.

Example: If the rate limit is set to 3 requests per minute, the bucket processes 1 request every 20 seconds. Incoming requests are added to the bucket, and if it is full, new requests are dropped.

Leaky Bucket Algorithm

Advantages

  • Provides a smooth and consistent request rate.
  • Simple to understand and manage.

Disadvantages

  • Can lead to dropped requests if the incoming rate consistently exceeds the processing rate.

High Level Diagram (HLD)

Let us ask a more general level question, how do we wanna limit the rate? Are we going to limit based on some user_id or just IP? How will the HLD (High level diagram) looks like? Here’s an overly simplified HLD of a API Rate limiter. It sits between the client and the edge services such as API gateway. Now-a-days, Rate limiters are shipped within the API gateway itself. API gateway is a fully managed service that supports rate limiting, SSL termination, authentication, IP whitelisting, servicing static content, etc. But for this particular case we’ll assume that API gateway doesn’t provide such functionality.

The basic idea of a rate limiting algorithm is quite simple. At a high-level, all we need is a counter to keep a track of how many requests are sent from the same user, IP address, etc. If the counter is larger than then limit, the request is dropped.

HLD of API Rate limiter

Let’s go over the above diagram. When a user wants to access a service, the request first goes to the Rate limiter. Based on the above defined algorithm, Rate limiter decides whether to allow this request or drop it. Now to store the previous states of counter we’d need some sort of storage and for Rate limiter, Key value datastores are the best design choice since the lookup and updation is nearly done in constant time thereby they do not add any additional latencies at their end.

What do we store in the db anyways?

Counters? Well depending on which algorithm we are going to use the data structure will change. For e.g. in case of fixed window we’d keep the following information

1{
2 "user059224": {
3 "count": 2,
4 "timestamp": "12:00:00"
5 },
6 "user059123": {
7 "count": 1,
8 "timestamp": "12:01:00"
9 }
10}

Notice, we are storing the timestamp as well as the count for the user. Now the way our rate limiter will check if we are exceeding the threshold is as follows:

  1. It will check first if currentTime — timestamp < 1m. If it is, it means we are in the middle of a window. It then checks count, since here the count is 2, it increments it and forward the request to the gateway.
  2. Another case, if the currentTime — timestamp > 1m. It means we are in a new window. So the first thing we’ll do is to update the timestamp to currentTime and update the count to 1.
  3. Last case, if the currentTime — timestamp < 1m but the count is 3, it will drop the request.

The above implementation can be easily done using an In memory hashmap but in cases we are expecting millions of user, storing these information in memory will definitely break the limiter. For this, we could go with Redis, a popular option to implement rate limiting. It is an in-memory store that offers fast lookups and updates.

What could go wrong in the above implementation?

  1. Since the above steps are for Fixed Window, it inherits the same issue we discussed somewhere above where it can potentially allow a bursty traffic exceeding the threshold.
  2. Atomicity is yet another problem with this approach because here we are first reading the count and then updating it. We can face a race condition here. Let’s say a user initiate two concurrent request, these request will be served by two parallel threads of which both let’s say checks the count to be 2, and they both were allowed. Ideally one thread should have blocked.

The HLD doesn’t answer about the rate-limiting rules. There can be cases when we want to take different rate limiting depending upon the type of requests

For e.g. we may only allow 5 OTP requests per minute, but we might want to allow 20 posts creation in a day. Notice, here the number as well as the periods are different.

Lyft open-sourced their rate-limiting components. Let’s have a look at them.

domain: post_creation
descriptors:
- key: posts
value: account
rate_limit:
unit: day
requests_per_unit: 20

This rule shows that the user can not create more than 20 posts in a day. These rules are often stored on a disk as they rarely change, but we can cache them in memory for faster lookups.

Back of the Envelop Estimation

Fixed Window Algorithm

Let’s answer some back of the envelop questions. Say we were to implement this on a single machine, how much memory would we need to store all of the user’s data?

Let’s say the “userId” can take 8 bytes. In java, 1 byte can hold 1 character., Corresponding to that a counter which is an integer that will require 4 bytes, and then we have timestamp as well, timestamps are usually long datatype and in Java that usually take 8 bytes.

So for 1 user, we’d need at least 8 + 4 + 8 = 20 bytes.

We’ll also be maintaining some sort of metadata (that’ll be the internals of hashtable), for that let’s provision 20 bytes more for each record.

Thus, for a single user we’ll need 40 bytes and to scale this for one million user, the total memory we would need would be 40 * 10⁶ bytes = 40 MB.

40 MB can easily fit on a single server; however we would definitely not want to route all our traffic through this rate limiter why? Because, let’s say we allow 1 user to hit 10 reqeusts per second this would amount to 10 * 1 million = 10 million RPS (requests per second) which is too much for a single server.

Thus for this sort of requirement it is inevitable to use a distributed cache and for that redis or memcache will be useful.

Here’s an in-memory implementation of Fixed Window

import java.util.HashMap;
import java.util.Map;

public class Limiter {

private static final Map<String, FixedWindow> hashMap = new HashMap<>();
// how many request you are going allow
private static final int THRESHOLD = 3;
// this is the period, if it is 1 it means you are gonna allow 3 request in 1 second
private static final int PERIOD_IN_SECONDS = 1;

static class FixedWindow {
public int count;
public long timestamp;

public FixedWindow(int count, long timestamp) {
this.count = count;
this.timestamp = timestamp;
}
}

public boolean request(String userId) {
// this method returns true or false.
// If we have exceeded the limit, it will return false signifying the request was dropped.
// otherwise true
FixedWindow fixedWindow = hashMap.get(userId);
long currentTime = System.currentTimeMillis();
if (fixedWindow == null || currentTime - fixedWindow.timestamp > PERIOD_IN_SECONDS * 1000L) {
hashMap.put(userId, new FixedWindow(1, currentTime));
return true;
} else {
// check if we have breached the count already
if (fixedWindow.count < THRESHOLD) {
fixedWindow.count += 1;
hashMap.put(userId, fixedWindow);
return true;
} else {
return false;
}
}
}

public static void main(String[] args) throws InterruptedException {
Limiter limiter = new Limiter();
System.out.println("Sent to API Gateway Server? " + (limiter.request("1234") ? "Yes" : "No"));
System.out.println("Sent to API Gateway Server? " + (limiter.request("1234") ? "Yes" : "No"));
System.out.println("Sent to API Gateway Server? " + (limiter.request("1234") ? "Yes" : "No"));
System.out.println("Sent to API Gateway Server? " + (limiter.request("1234") ? "Yes" : "No"));
Thread.sleep(1000);
System.out.println("Sent to API Gateway Server? " + (limiter.request("1234") ? "Yes" : "No"));
System.out.println("Sent to API Gateway Server? " + (limiter.request("1234") ? "Yes" : "No"));

}
}

Sliding Window Algorithm

In this algorithm we can use a deque associated with a userId. A deque of timestamps.

{
"user059224": [
"12:00:00",
"12:01:10",
"12:01:28"
],
"user013571": [
"14:01:02",
"14:05:11",
"14:20:00"
]
}

When a user hits the API:

  • We first fetch the deque associated with that user id, and remove all the elements from beginning that are out of the range i.e. say my request comes at 12:01:30 for user: user059224. And we are allowing 10 requests per minute. If that is the case all the timestamps before 12:00:30 (t — 1m) are invalid. So we remove them. This gives us the following state
{
"user059224": [
"12:01:10",
"12:01:28"
],
"user013571": [
"14:01:02",
"14:05:11",
"14:20:00"
]
}
  • At this moment, the size of the queue is 2 only but we can accommodate 10 requests so we’ll add the timestamp to the queue (last) and let the request pass to the API gateway. This gives us the following state:
{
"user059224": [
"12:01:10",
"12:01:28",
"12:01:30"
],
"user013571": [
"14:01:02",
"14:05:11",
"14:20:00"
]
}
  • Let’s compute how much memory we’ll need to store all the user data for sliding window. Here again, the userId will take 8 bytes. But this time we are storing list of timestamps, and each timestamp is of long type thus 1 element of that list will take around 8 bytes. And since we are assuming that we will support 10 requests per minute. In the worst case the list will contain 10 elements. Plus let’s have a buffer of 12 bytes for deque implementation and 20 bytes for hashtable overhead.
  • Thus in this way 1 user will take: 8 (userId) + (8 (timestamp) + 20 (hashtable) + 12 (deque)) * 10 = 408 bytes.
  • For 1 million users we will need 408 MB which again is still less.
  • Notice if we increase the window to hours. For e.g. if we say we are allowing 500 requests per hour, in that case we’ll have to store 500 timestamps and that will give the following estimations: 8 + (8 + 20 + 12) * 500 ~ 20,000 bytes = 20KB, and for 1 million it will be 20 GB.
  • Sliding window takes a lot of memory as compared to the Fixed Window.
  • Here’s an In-memory implementation
import java.util.ArrayDeque;
import java.util.Deque;
import java.util.HashMap;
import java.util.Map;

public class Limiter {
private static final Map<String, Deque<Long>> hashMap = new HashMap<>();
// how many request you are going allow
private static final int THRESHOLD = 3;
// this is the period, if it is 1 it means you are gonna allow 3 request in 1 second
private static final int PERIOD_IN_SECONDS = 1;

public boolean request(String userId) {
Deque<Long> deque = slidingWindowHashMap.get(userId);
long currentTime = System.currentTimeMillis();
// first check if this is the first request, if that is the case the deque won't be initialized
if (deque == null) {
deque = new ArrayDeque<>();
deque.addLast(currentTime);
hashMap.put(userId, deque);
return true;
}
// keep on removing the timestamps that are no longer the t to t - 1s window
while (!deque.isEmpty() && currentTime - deque.getFirst() > PERIOD_IN_SECONDS * 1000L) {
deque.removeFirst();
}
// if that window has more than the 3 request, drop it
if (deque.size() >= THRESHOLD) {
return false;
}
// otherwise add it
deque.addLast(currentTime);
return true;
}


public static void main(String[] args) throws InterruptedException {
Limiter limiter = new Limiter();
System.out.println("Sent to API Gateway Server? " + (limiter.request("1234") ? "Yes" : "No"));
System.out.println("Sent to API Gateway Server? " + (limiter.request("1234") ? "Yes" : "No"));
System.out.println("Sent to API Gateway Server? " + (limiter.request("1234") ? "Yes" : "No"));
System.out.println("Sent to API Gateway Server? " + (limiter.request("1234") ? "Yes" : "No"));
Thread.sleep(1000);
System.out.println("Sent to API Gateway Server? " + (limiter.request("1234") ? "Yes" : "No"));
System.out.println("Sent to API Gateway Server? " + (limiter.request("1234") ? "Yes" : "No"));

}
}

Note: Token bucket and Leaky bucket implementation is out of the scope of this article.

Before we end this article let’s discuss on how should we rate limit the requests. Shall we do it on IP or by User IDs?

Rate Limiting Criteria

Rate limiting can be implemented either by IP address or by user ID, depending on the specific needs of your application. IP-based rate limiting is simpler and effective for public APIs, providing immediate protection against DoS attacks by limiting requests from a single IP. However, it may unfairly restrict users sharing the same IP or allow attackers to bypass limits using proxies.

User ID-based rate limiting ensures fair usage for each authenticated user, offering a consistent experience even if users change IP addresses. This approach, though, requires authentication and is more complex to implement.

For optimal protection, you can combine both methods: primarily limiting by user ID to ensure fairness and secondarily by IP to guard against abusive IP behavior. This hybrid strategy balances user fairness with robust security.

Please note: this implementation is overly simplified and high-level, but it gives an overall understanding of how things work under the hood. I hope this article helped you understand the basics of search engines.

Thanks a lot. Please visit my website to show your support. It would mean a lot to me: https://www.0xkishan.com/

References:

  1. https://www.0xkishan.com
  2. https://www.cloudflare.com/en-gb/learning/bots/what-is-rate-limiting/
  3. https://www.educative.io/courses/grokking-modern-system-design-interview-for-engineers-managers/system-design-the-rate-limiter

--

--