The Importance of Idempotency in Data Engineering

Written by Aaron Fowles 24th October 2022

LinkedInGitHubLink

Idempotency is a property of a process, if that process ensures that repeated attempts to apply an operation do not result in unintentionally repeated data. Put another way, the output of your process must always produce the same results regardless of how many times it runs.

Why?

This is important for the obvious reason that we do not want to duplicate data, that is multiple instances of what should be a single record. However, idempotent processes also come with other benefits. They are inherently easier to test than some other state-based approaches (being more akin to the Functional Programming paradigm). They are also easy to scale horizontally as we can deterministically map inputs to outputs across many processes.

We've established that Idempotency is a good thing but how do we do it?

How?

To avoid duplicate data you must ensure your process does its own housekeeping - clear out any data from the destination to which its responsible for writing. Remember Murphy's Law when you design idempotent systems as you have to assume that your process will fail in many unforeseen ways. Make the process do some anticipatory clean-up to protect against the fact it may have previously failed without you knowing exactly why.

That's easy if you know the destination but much more challenging if you are writing to a stream that will be processed by other, as yet unknown processes later. If you are writing to a stream then make clear within the data itself, ideally via modelled schema such as Avro or Protocol Buffers, what the "key" or "id" of the record is. Also include a description of some other, usually time-based or incrementing integer value, to use to select the most recent record downstream.

Once you guarantee Idempotency for your process there are a number of ways that instances of that process can be orchestrated to ensure the system as a whole works once deployed.

Time-based

In the world of daily batch jobs, Idempotency is easy. Parameterise each run by calendar day. This allows us to back-fill in parallel and, crucially for Big Data, makes the scalability of our job predictable - i.e. we'll never have to rely on a single process chugging through from day dot up to the present date. This approach can be extended to more granular time-frames up to an hour. We can even make this unit of parallelisation configurable. If there is a genuine value-add to the organisation from anything more frequent than an hour then it may be worth considering stream processing rather than a batch process.

There are multiple scenarios that could still lead to duplicate data in the process as a whole though. One is that this approach eliminates duplicates within our unit of parallelisation but not between. This will happen if a duplicate appears in multiple different calendar days for example. Using event time rather than ingest or processing time as the unit of parallelisation addresses this to some extent. However, there is often the case where the "duplicates" are actually the history of the mutation of a record over time and we are only interested in the most recent state of that record. The challenge of retrospectively analysing the full history to identify the most recent records is difficult to scale horizontally in traditional Data Lakes. For this to really work you should use an additional element of indexing or hashing for it work effectively.

Hash-based

All of the properties of an idempotent process apply to hashing in the same way that they do to time-based approaches. The difference is that the unit of parallelisation may be based on something other than time. Hashing an id for example allows you to reduce the problem to a smaller subset of all IDs. Then, as long as the storage metadata contains a lookup of the hashes of the IDs contained within that block, the processing engine will skip many partitions of data and identify our most recent record much more effectively.

Google's Managed Service Big Table addresses this problem at huge scale and it's well worth reading up on how they achieve that although it is intended primarily for real-time applications rather than analytics solutions. A more analytics focused open-source option that solves many of these challenges is Apache Iceberg which is a super-exciting project.