Author

I am Joannes Vermorel, founder at Lokad. I am also an engineer from the Corps des Mines who initially graduated from the ENS.

I have been passionate about computer science, software matters and data mining for almost two decades. (RSS - ATOM)

Meta
Tags

Entries in datamining (1)

Friday
Oct142011

Oddities of machine learning software code

Developping machine learning software is special. I did already describe a bit how it feels to be in a machine learning company, but let's be a bit more specific concerning the code itself.

One of most shocking aspect of machine learning code is that it tends to be full of super-short cryptic 1-letter or 2-letter variable names. This goes completely against the general naming conventions which emphasis readability over brievity. Yet, over the years, I have found that those compact names where best for mathematical / statistical / numerical algorithms.

Indeed,

  • Logic is typically overly intricate, with tons of nested loops and seemingly random stopping conditions. Hence, even if the variables were perfectly readable, the logic would remain show-stopper for any fast-reading attempt.
  • Variables typically hold intermediate computational results, which cannot be associated with 2 or 3 English words without being extremely ambiguous at best. It's not a = OnButtonClick but rather a = InterpolatedDetrendedDeseasonalizedQuantile90PercentsOfPromotionEffects.

As a result, extreme variable name brievity makes the code much more compact which in turns makes it easier to understand the logic. It forces the coder digging into the code to learn by heart the semantic of the variables (because names are cryptic), but this effort is only marginal compared to the amount of effort to grasp the logic itself anyway.

Then, magic numbers are all over the place, frequently inlined with the rest of the code. Again, for non-machine-learning, magic numbers are a big NO-NO, and a cardinal rule of sane software design consist of clear speration between data and logic. Yet, in statistical algorithms, those seemingly random numerical values are the result of the incremental tuning that is necessary to obtain the desired performance and accuracy.

  • There is no benefit in isolating the magic number, because it is used only once.
  • The actual numerical value is typically more insightful than the variable name. It helps the developer to get a sense of the behavior of the algorithm.

Then, it remains a good practice to add a lot of inline comments to justify the purpose of the magic numbers, and how they have been optimized.

If your code is super fast, you're probably getting it wrong. For most machine learning problems, it's better to try to take advantage of the outrageously large amount of processing power available nowadays to improve results. I am not saying that super fast code is bad in itself, but if your code is super fast, then it means that you've got room to go for more complex methods that would consume more resources in exchange for better, more accurate, results.

Unit tests are both very handy to validate small block pure-mathematical operations, and yet, quasi-useless for the bulk of the machine learning logic. Indeed, when it comes to statistical accuracy, there is no black & white, but only in shades of gray. As long performance is acceptable, the overall accuracy is the only metric that matter. In particular, it happens, from time to time, that a bug - aka a piece of code that does not reflect the original intent of the developer - turns out to behave well over the data. On average, bugs tend to degrade accuracy, but sometime, it just stumbles upon an interesting (and counter-intuitive) behavior.

Finally, Object-Oriented Programming is still around, but seldom usedFunctional Programming is king. This pattern reflects the fact that the machine learning problem itself, either classification or regression is nothing but trying to build a big complex function to tackle real-world data.