Redis Optimizations | Darren Reid

I thought I’d write up some of my experiences at Solcast building a global solar forecasting system, and some of the challenges I had storing and updating millions of forecasts every 5-15 minutes.

Problem

We needed somewhere to store weather and solar forecast data being generated every time a new satellite image was pulled from various geo-stationary satellites like Himawari-8/9, GOES-17, and others. Each time a new satellite image was pulled in, a set of geo-spatial positions within that region was used to generate a forecast for each point. This could be up to ~150k points, each with many parameters for a time series of 8-48 time horizons. So a single satellite region could generate nearly 100M data points that needed to be stored and accessible quickly through an API.

Why Redis

One of the main reasons I chose Redis to solve this problem is that forecast data was only valuable for a short period of time. As soon as a new set of forecast was generated, their value to API users dropped over time. As soon as a newer forecast was generated, the old forecast can be replaced since better data is now available. The shorter your forecasts, the more accurate they can be. Due to this temporal nature of forecasting, we would only need ‘hot’ forecasts available through the API, and the latency of having them available was extremely important.

Redis here is plain Elasticache running the latest version between 2016-2020. I think between version 4 and 5.

Another big reason was I already had Redis in the stack to handle sessions centrally, and didn’t want add more cost to our growing AWS bill, or complexity for the team to manage which was, at the time in late 2016, just myself doing everything but modeling, and the team wouldn’t grow under later in 2017. We also had Postgres with PostGIS in the stack, but this was too slow, and put our already strained database and IOPs too far for what we could afford with the runway we had. Also, AWS RDS IOPs really sucked at the time.

Also, the API was designed to be simple for the main use-case of “Give me the latest forecast for location X”. The data was being generated by a large single server running MatLab as the founder build his models using what he knew, and we needed to keep it that way since it was primarily just us two building the system. And the location that the user specified was never going to be exactly where we had data, so we needed to provide a inverse distance weighted average of multiple points, requiring a way to quickly find points nearby a given location. Given all these limitations, some additional latency of writing the data to S3 which fed into the API servers to handle ingestion was the fastest way forward.

If more time and/or people was available, I would have likely gone with something like GridDB in hindsight.

In summary we have:

Forecasts lose value over time
We have new forecasts constantly coming in
We need to distance average forecasts
We need to look up a user given location

So from a Redis point of view, we have the following problems:

Data needs to be written quickly
Data needs to be replaced efficiently
Data needs to be atomically available for a given region/set
Data needs to be indexed geo-spatially

Writing data quickly - Pipelining

In a nutshell, pipelining lets you issue multiple commands without waiting for a response. While redis responds quickly, the more commands you need to issue, latency can start to add up significantly. Lets look at an example.

const int iterations = 100000;
var redisManager = new RedisManagerPool("localhost:6379");

// Non-pipelined approach
var sw = Stopwatch.StartNew();
using (var redis = redisManager.GetClient())
{
    for (int i = 0; i < iterations; i++)
    {
        redis.Set($"key{i}", $"value{i}");
    }
}
sw.Stop();

The above is a simple example of using Redis via ServiceStack.Redis library, just issuing SET commands. Running locally this takes around an average of 5200ms. We can run the same commands using pipelining:

// Pipelined approach
sw.Restart();
using (var redis = redisManager.GetClient())
{
    using (var pipeline = redis.CreatePipeline())
    {
        for (int i = 0; i < iterations; i++)
        {
            pipeline.QueueCommand(r => r.Set($"key{i}", $"value{i}"));
        }
        pipeline.Flush();
    }
}
sw.Stop();
Console.WriteLine($"Pipelined: {sw.ElapsedMilliseconds}ms");

Here we get an averaged result from 5 runs of just 260ms. We get a roughly 20x improvement with a fairly simply change.

Discarding old data efficiently

In the above example we are setting a new KEY for each value we want to store. If we put a Time To Live (TTL) in the command of 10 seconds like so:

pipeline.QueueCommand(r => r.Set($"key{i}", $"value{i}", TimeSpan.FromSeconds(10)));

This means that each key will expire after 10 seconds. Redis takes care of this for us, but it doesn’t happen for free. In fact, expiring a large number of keys can cause large CPU spikes on the Redis machine. You could just reuse keys and not have a TTL, however, in this case we can’t mix forecast updates as that would create forecasts for users that we can’t account for or reproduce since they are geo-spatially averaged. So how do we avoid this load on our system? Use the right data structure.

Mapping data structures to domains

One of the biggest optimizations I’ve found when dealing with large datasets generally come from knowing more about the domain, and how you can exploit how the data will be used to improve the speed and cost of how you can store it. Eg, you can make trade offs around flexibility, cost and speed. For example, the Solcast product was centered around the user providing points of interest, a latitude and longitude and that data was updated all at once per satellite region, we can group points for each satellite into Redis Hash.

Redis hashes are what they sound like. You can think of them like a mini Redis key value store at a standard KEY location. Eg, if we have a hash at key “myhashkey”, we can set a “field” of “myfield” of value “Hello” with:

HSET myhashkey myfield "Hello"

Or in C#:

pipeline.QueueCommand(r => r.SetEntryInHash("myhashkey", "myfield", "Hello"));

HSET is just as efficient as SET, so we don’t trade off performance to use this. After we set all the data we need in the hash, we can then expire the key like any other. This will mean that Redis only has to track and expire one key for all the fields in the hash.

using (var redis = redisManager.GetClient())
{
    using (var pipeline = redis.CreatePipeline())
    {
        for (int i = 0; i < iterations; i++)
        {
            // HSET with multiple fields
            pipeline.QueueCommand(r => r.SetEntryInHash("myhashkey", $"field{i}", $"value{i}"));
        }
        
        // Set TTL for the hash
        pipeline.QueueCommand(r => r.ExpireEntryIn($"myhashkey", TimeSpan.FromSeconds(10)));
        
        pipeline.Flush();
    }
}

Atomic updates per region

This requirement isn’t so much an optimization as it is an example of using Redis datasets to match the domain and requirements effectively, making trade offs along the way. We have each satellite data in its own key, so we can store a pointer to the timestamped ’live’ key. This will cost an additional lookup per request, but gives us an atomic update since we only update the pointer once the Flush command has finished. Eg, the key might look like himawari-8-2025-01-01T01:25:00Z containing all our forecast data, and live-himawari-8 would be set to himawari-8-2025-01-01T01:20:00Z before the update has finished, and himawari-8-2025-01-01T01:25:00Z after, making sure our data is consistent.

Geo-spatial index

Redis does do geo-spatial lookups, it does however use a simplistic model of the earth being a perfect sphere. The worst case scenario was a 0.5%, which was initially used for simplicity, but switched to the just caching results from PostGIS for consistency. GEOSEARCH can be quite intensive if you are looking up millions of points quickly, and generally clients were interested in where solar assets like utility scale solar farms were located which made caching the lookup another appropriate optimization.

Conclusion

These changes were implemented when Redis latency during ingestion started to impact clients, and CPU utilization went from peaking at ~70% at 1 minute time windows to barely breaking 4% under heavy load. While it wasn’t the optimal solution to the problem, given the resource and time constraints at the time, it worked out well. During my 4+ years in this role we had multiple RDS instance loss events, and various other issues with infrastructure, but Redis running on Elasticache was rock solid. Though it has been mainly forked as valkey, it is still a great simple tool that, for the most part, just works.