Hectic times. Looking at the last entry, I realize it has been half a year already since my last post.
The Big Data projects I do, and the more I realize how usually scalability aspects for business projects are irrelevant to the point that the quasi-totality of the valuable data crunching processes could actually be run on a smartphone if the proper approaches are taken. Obviously there is no point in actually doing the analysis on a smartphone, this merely illustrating that really it does not take much computational power.
While all vendors are boasting being able to crunch terabytes of data, it turns out that it's very rare to even face dataset bigger than 100MB when properly represented in memory. The catch is that between a fine tuned data representation and a verbose representation - say XML or SQL; there is typically a factor 100x to 1000x involved as far the data footprint is concerned.
The simplest way to deal with Big Data is to turn it in to Small Data. Let's review a couple of handy tricks frequently used at Lokad to compress data.
1. Get rid of everything that is not required
While this might seem obvious, whenever we tackle a Big Data, we typically start by ditching about 90% of the data that is not even required for the task at hand. Frequently, this covers unused fields and segments of the data that can be safely excluded for the analysis.
2. Turn dates in 16-bits integers when the time is not needed
A date-time is represented as an 8-bytes data structure in most languages. Yet, a single unsigned 16-bits integer gives you 65536 combinations, that is, enough to cover 179 years of daily increments, which proves to be usually sufficient. That's a 4x memory saving.
3. Turn 8-bytes floating point values in 4-bytes or even 2-bytes values
Whenever money is involved, businesses rely on 8-bytes or even 16-bytes floating point values. However, from a statistical viewpoint, such a precision typically makes little sense, it's like computing everything in grams, to finally upper round the final result to the next ton. The 2-bytes precision, aka the half-precision floating point format, is sufficient to accurately represent the price of most consumer goods for example. That's a 4x memory saving.
4. Replace strings by keys with lookup tables
The lookup tables are extremely simple and fast data structure. Depending on the situation, you can typically use lookups to replace fields that contain strings but with many repeated occurrences. Your mileage may vary (YMMV) but lookups, when applicable frequently bring a 10x memory saving.
5. Get rid of objects, use value types instead
Objects (as in C# objects or Java objects) are very handy, but unfortunately, they come with a significant memory overhead, typically of 16-bytes per object when working under 64-bits environments, that is, the default situation nowadays. To avoid those situations, you need to use value types (aka struct, unfortunately not available in Java). Value types usually bring a 2x memory saving.
6. Use plain arrays not "smart" collections
Most modern languages emphasize collections such as dynamic arrays; however, those collections are far from being as memory-efficient as plain old arrays. YMMV but arrays over collections frequently bring a 2x memory saving.
7. Use variable length encoding
The variable length encoding represents a simple compression pattern favoring small values over large values. This technique is especially useful when the original dataset is preprocessed to reassign the identifiers based on their usage frequency; i.e. allocating integers by decreasing frequency. YMMV depending on the actual distribution of identifiers in the dataset, but this typically grants a 4x memory saving.
8. Vectorialize listing when possible
Many data represented as listings in their original relational representation can be vectorialized somehow. For example, if I am interested in the analysis of the return frequency of a web visitor over the last 6 months on a given website, a bit array of 184 bits (aka 23 bytes) can already provides boolean flags of visits for any given day over the last 6 months. When application, this typically grants a 10x memory saving.