I am Joannes Vermorel, founder at Lokad. I am also an engineer from the Corps des Mines who initially graduated from the ENS.

I have been passionate about computer science, software matters and data mining for almost two decades. (RSS - ATOM)


Entries in bigdata (3)


8 tips to turn your Big Data into Small Data

Hectic times. Looking at the last entry, I realize it has been half a year already since my last post.

The Big Data projects I do, and the more I realize how usually scalability aspects for business projects are irrelevant to the point that the quasi-totality of the valuable data crunching processes could actually be run on a smartphone if the proper approaches are taken. Obviously there is no point in actually doing the analysis on a smartphone, this merely illustrating that really it does not take much computational power.

While all vendors are boasting being able to crunch terabytes of data, it turns out that it's very rare to even face dataset bigger than 100MB when properly represented in memory. The catch is that between a fine tuned data representation and a verbose representation - say XML or SQL; there is typically a factor 100x to 1000x involved as far the data footprint is concerned.

The simplest way to deal with Big Data is to turn it in to Small Data. Let's review a couple of handy tricks frequently used at Lokad to compress data.

1. Get rid of everything that is not required

While this might seem obvious, whenever we tackle a Big Data, we typically start by ditching about 90% of the data that is not even required for the task at hand. Frequently, this covers unused fields and segments of the data that can be safely excluded for the analysis.

2. Turn dates in 16-bits integers when the time is not needed

A date-time is represented as an 8-bytes data structure in most languages. Yet, a single unsigned 16-bits integer gives you 65536 combinations, that is, enough to cover 179 years of daily increments, which proves to be usually sufficient. That's a 4x memory saving.  

3. Turn 8-bytes floating point values in 4-bytes or even 2-bytes values

Whenever money is involved, businesses rely on 8-bytes or even 16-bytes floating point values. However, from a statistical viewpoint, such a precision typically makes little sense, it's like computing everything in grams, to finally upper round the final result to the next ton. The 2-bytes precision, aka the half-precision floating point format, is sufficient to accurately represent the price of most consumer goods for example. That's a 4x memory saving.

4. Replace strings by keys with lookup tables

The lookup tables are extremely simple and fast data structure. Depending on the situation, you can typically use lookups to replace fields that contain strings but with many repeated occurrences. Your mileage may vary (YMMV) but lookups, when applicable frequently bring a 10x memory saving.

5. Get rid of objects, use value types instead

Objects (as in C# objects or Java objects) are very handy, but unfortunately, they come with a significant memory overhead, typically of 16-bytes per object when working under 64-bits environments, that is, the default situation nowadays. To avoid those situations, you need to use value types (aka struct, unfortunately not available in Java). Value types usually bring a 2x memory saving.

6. Use plain arrays not "smart" collections

Most modern languages emphasize collections such as dynamic arrays; however, those collections are far from being as memory-efficient as plain old arrays. YMMV but arrays over collections frequently bring a 2x memory saving.

7. Use variable length encoding

The variable length encoding represents a simple compression pattern favoring small values over large values. This technique is especially useful when the original dataset is preprocessed to reassign the identifiers based on their usage frequency; i.e. allocating integers by decreasing frequency. YMMV depending on the actual distribution of identifiers in the dataset, but this typically grants a 4x memory saving.

8. Vectorialize listing when possible

Many data represented as listings in their original relational representation can be vectorialized somehow. For example, if I am interested in the analysis of the return frequency of a web visitor over the last 6 months on a given website, a bit array of 184 bits (aka 23 bytes) can already provides boolean flags of visits for any given day over the last 6 months. When application, this typically grants a 10x memory saving.


Big Data: choosing the problem before choosing the solution

My company has started several important big data missions, and I am taking here the opportunity publish some insights are are relevant to all those initiatives.

A major (and frequent) pitfall of the Big Data projects consists of starting with a solution instead of starting with a problem. In particular, software vendors (Lokad's included) are pushing their own Big Data recipe which will randomly involve:

  • Hadoop
  • HBase
  • Amazon EC2
  • Cassandra
  • Windows Azure
  • Storm
  • Node.js
  • ...

However, the notion of "Big" data is very relative: cheap 1TB hard-drives are now available at your nearest supermarket, and very very few problems faced by companies, even very large ones, do require require more than 100 GB of data to process. 

Usually, even the largest data sources of the largest companies do fit on a smartphone when properly represented. 

Impedance mismatch of BIG frameworks

The performance achieved by well-known Big Data frameworks are mind-blowing: Facebook claims to process 100PB of data over Hadoop. That's massive, and massively impressive as well.

However, before jumping on Hadoop (or any similar Big Data frameworks), one has to really estimate the friction costs involved. While Hadoop is certainly simpler than say MPI, it's still a complicated distributed framework which do require a lot of skills to be properly and efficiently operated.

If the very same goal can be achieved on a single machine within a very acceptable timeframe, then, in my experience the dumb solution is going to be about 100x cheaper (*) and easier to run and to maintain compared to the "distributed" variant.

(*) I am not refering to hardware costs, but to wetware costs (aka people) which represents 99% of the cost anyway for virtually every company, minus a few social networks and search engines.

The untold story about Hadoop (and its peers) is that it works only if, and only if, the data is very meticuluously organized to be made suitable for a processing through the framework. If the data is incorrectly partioned, then Hadoop plus thousands of servers are no faster than a single machine.

Enterprise Big Data start at 100MB

Facebook is facing Petabytes of data, that's millions of Gigabytes, but is really your company facing that much data? Do you need to plug that much data in to solve the problem at hand? Unless you work for a short list of about 100 companies on Earth, I seriously doubt it.

I observe that for most entreprises, "Big Data" starts at 100MB when:

  • Excel is no more a solution.
  • SQL is no more a solution (*)

(*) Yes, you can have a lot more than 100MB in a SQL database. However, reading the entire dataset through SQL needs to be done with care to avoid re-scanning the data thousands of times. In practice, in 90% of the data crunching situations, I observe that it's easier to remove the SQL database, as opposed to improve the performance of the queries over the relational database.

Facing the problems

Thus, whenever data is involved, the initiative should start by facing the problems that are the true roadblock to deliver a "solution". Those problems are typically:

  • Collecting and servicing the data: About every single company I visit has problems on collecting and servicing the data. The most obvious symptom is typically the lack of documentation concerning the data itself, and all the nitty-gritty insights to need to make anything of it. No technology is going to solve that problem, only people and process.
  • Choosing the metrics to be optimized: They are so many parts of the business that could be improved through a smart exploitation of the data, that it is extremely tempting to think that some (hype) technology might be THE answer to everything. This is not going to happen. Solving a problem through data is tough, and without metrics, you don't even for sure you're moving in the right direction. Frequently, defining the metric - that is the problem to be solved - is harder than implementing the solution. 

Thus, before jumping to next cool vendor solution, I urge to start by facing the very uncool aspects of the problem. Frequently, the "solution" consists of removing an ingredient of the previous solution.


A few tips for Big Data projects

Floppy disk illustrationAt Lokad, we are routinely working on Big Data projects, primarily for retail, but with occasional missions in energy or biotech companies. Big Data is probably going to remain as one of the big buzzword of 2012, along with a big trail of failed projects. A while ago, I was offering tips for Web API design, today, let's cover some Big Data lessons (learned the hard way, as always).

1. Small Data trump Big Data

There is one area that captures most of the community interest: web data (pages, clicks, images). Yet, the web-scale, where you have to deal with petabytes of data, is completely unlike 99% of the real-world problems faced about every other verticals beside consumer internet

For example, at Lokad, we have found that the largest datasets found in retail could still be processed on a smartphone if the data is correctly represented. In short, for the overwhelming majority of problems, the relevant data, once properly partitioned, take less than 1GB.

With datasets smaller than 1GB, you can keep experimenting on your laptop. Map-reducing stuff on the cloud is cool, but compared to local experiments on your noteboook, cloud productivity is abysmal.

2. Smarter problems trump smarter solutions

Good developers love finding good solutions. Yet,when facing Big Data problem, it just too temping to improve stuff, as opposed to challenge the problem in the first place.

For example at Lokad, as far inventory optimization was concerned, we have been pushing years of efforts at solving the wrong problem.  Worse, our competitors has been spending hundreds of man-years of efforts doing the same mistake ...

Big Data means being capable of processing large quantities of data while keeping computing resource costs negligible. Yet, most problems faced in the real world have been defined more than 3 decades ago, at a time where any calculation (no matter how trivial) was a challenge to automate. Thus, those problems come with a strong bias toward solutions that were conceivable at the time.

Rethinking those problems is long overdue.

3. Being non-intrusive is scalability-critical

The scarcest resource of all is human time. Letting a CPU chew 1 million numbers is nothing. Having people reading 1 milion numbers takes an army of clercs. 

I have already posted that manpower requirements of Big Data solutions were the most frequent scalability bottleneck. Now, I believe that if any human has to read numbers from a Big Data solution, then solution won't scale. Period.

Like AntiSpam filters, Big Data solutions need to tackle problems from an angle that does not require any attention from anyone. In practice, it means that problems have to be engineered in a way so that they can be solved without user attention. 

4. Too big for Excel, treats as Big Data

While the community is frequently distracted by multi-terabyte datasets, anything that does not conveniently fit in Excel is Big Data as far practicalities go:

  • Nobody is going to have a look at that many numbers.
  • Opportunities exist to solve a better problem.
  • Any non-quasi-linear algorithm will fail at processing data in a reasonable amount of time.
  • If data is poorly architectured / formatted, even sequential reading becomes a pain.

Then comes the question: how should handle Big Data? However, the answer is typically very domain-specific, so I will leave that to a later post.

5. SQL is not part of the solution

I won't enter (here) the debate SQL vs NoSQL, instead let's outline that whatever persistence approach is adopted, it won't help: 

  • figuring out if the problem is the proper one to be addressed,
  • assessing the usefulness of the analysis performed on the data,
  • blending Big Data outputs into user experience.

Most of the discussions around Big Data end up distracted by persistence strategies. Persistence is a very solvable problem, so engineers love to think about it. Yet, in Big Data, it's the wicked parts of the problem that need the most attention.