Data paging, that is, the retrieval of a large amount of data through a series of smaller data retrievals, is a non-trivial problem. Through Lokad, we have implemented about a dozen of extensive API integrations, and reviewed a few dozens of other APIs as well.
The conclusion is that as soon as paging is involved, nearly all web APIs get it wrong. Obviously, rock-solid APIs like the ones offered by Azure or AWS are getting it right, but those outstanding APIs are exceptions rather than the norm.
The obvious pattern that doesn't work
I have lost count of the APIs that propose the following broken pattern to page through their data, a purchase order history for example:
Where page is the page number and pagesize the number of orders to be retrieved. This pattern is fundamentally unsafe. Any order deleted while the enumeration is in-progress will shift the indices which, in turn, is likely to cause another order to be skipped.
There are many variants of the pattern, and everytime the problem boils down to: the "obvious" paging pattern leads to a flawed implementation that fail whenever concurrent writes are involved.
The "updated_after" filter doesn't work either
Another popular approach for paging is to leverage a filter on the update timestamp of the elements to be retrieved, that is:
Then, in order to page the request, the client is supposed to take the most recent updated_at value from the response and to feed this value back to the API to further enumerate the elements.
However this approach does not (really) work either. Indeed, what if all elements have been updated at once? This can happen because of a system upgrade or because of any kind of bulk operation. Even if the timestamp can be narrowed down to the microsecond, if there are 10,000 elements to be served all having the exact same udpate timestamp, then, the API will keep sending a response where max(updated_at) is equal to the request timestamp.
The client is not enumerating anymore, the pattern has failed.
Sure, it's possible to tweak the timestamps to make sure that all the elements gracefully spread over distinct values, but it's a very non-trivial property to enforce. Indeed, a datetime column isn't really supposed to be defined with unicity constraint in your database. It's feasible, but odd and error prone.
The fallacy of the "power" APIs
Some APIs provides powerful filtering and sorting mechanisms. Thus, through those mechanims, it is possible to correctly implement paging. For example by combining two filters: one the update datetime of the items and one on the item identifier. A correct implementation is far from trivial however.
Merely offering the possibility to do the right thing is not sufficient: doing the right thing should be the only one possibility. This point is something that Lokad learned the hard way early on: web APIs should offer one and only one way to do each intended operation.
If the API offers a page mechanism but that the only way to correctly implement paging is to not use it; then, rest assured that the vast majority of the client implementations will get it wrong. From a design viewpoint, it's like baiting developers into a trap.
The "continuation token" as the only pattern that works
To my knowledge, there is about only one pattern that works for paging, it's the continuation token pattern.
Where every request to a paged resource like the purchase orders has the possibility of returning a continuation token on top of the elements returned when not all elements could be returned in one batch.
On top of being correct, that pattern has two key advantages:
- It's very hard to get it wrong on the client side. There is only one way to do anything with the continuation token: it's to feed it again to the API.
- The API is not commited into returning any specific number of elements (in practice a high upper bound can still be documented). Then, if some elements are particularly heavy or if the server is already under heavy workload, smaller chunks can be returned.
This enumeration should not provide any garantee that the same element won't be enumerated more than once. The only garantee that should be provided by the paging through tokens is that ultimately all elements will be enumerated at least once. Indeed, you don't want to end-up with tokens that embed some kind of state on the API side; and in order to keep it smooth and stateless, it's important to lift this constraint.
Then, continuation tokens should not expire. This property is important in order to offer the possibility to the client perform incremental update on a daily, weekly or even on a monthly schedule depending on what makes sense from a business viewpoint.
No concurrency but data partitions
The continuation token does not support concurrent data retrieval: the next response has to be retrieved before being able to post the next request. Thus, in theory, this pattern somehow limit the amount of data that can be retrieved.
Well, it's somewhat true, and yet mostly irrevelant. First, Big (Business) Data is exceedingly rare in practice, as the transation data of the largest companies tend to fit on a USB key. For all the APIs that we have integrated, putting aside the cloud APIs (aka Azure or AWS), there was not a single integration where the amount of data was even to close to justifying concurrent data accesses. Slow data retrieval is merely a sign a non-incremental data retrieval.
Second, if the data is so large that concurrency is required, then, partitionning the data is typically a much better approach. For example, if you with to retrieve all the data from a large retail network, then the data can be partitionned per store. Partitionning will be making things easier both on the API side and on the client side.