Author

I am Joannes Vermorel, founder at Lokad. I am also an engineer from the Corps des Mines who initially graduated from the ENS.

I have been passionate about computer science, software matters and data mining for almost two decades. (RSS - ATOM)

Meta
Tags
Friday
May082015

Nearly all web APIs get paging wrong

Data paging, that is, the retrieval of a large amount of data through a series of smaller data retrievals, is a non-trivial problem. Through Lokad, we have implemented about a dozen of extensive API integrations, and reviewed a few dozens of other APIs as well.

The conclusion is that as soon as paging is involved, nearly all web APIs get it wrong. Obviously, rock-solid APIs like the ones offered by Azure or AWS are getting it right, but those outstanding APIs are exceptions rather than the norm.

The obvious pattern that doesn't work

I have lost count of the APIs that propose the following broken pattern to page through their data, a purchase order history for example:

https://example.com/api/purchaseorders?page=2&pagesize=25

Where page is the page number and pagesize the number of orders to be retrieved. This pattern is fundamentally unsafe. Any order deleted while the enumeration is in-progress will shift the indices which, in turn, is likely to cause another order to be skipped.

There are many variants of the pattern, and everytime the problem boils down to: the "obvious" paging pattern leads to a flawed implementation that fail whenever concurrent writes are involved.

The "updated_after" filter doesn't work either

Another popular approach for paging is to leverage a filter on the update timestamp of the elements to be retrieved, that is:

https://example.com/api/purchaseorders?updated_after=2015-04-29

Then, in order to page the request, the client is supposed to take the most recent updated_at value from the response and to feed this value back to the API to further enumerate the elements.

However this approach does not (really) work either. Indeed, what if all elements have been updated at once? This can happen because of a system upgrade or because of any kind of bulk operation. Even if the timestamp can be narrowed down to the microsecond, if there are 10,000 elements to be served all having the exact same udpate timestamp, then, the API will keep sending a response where max(updated_at) is equal to the request timestamp.

The client is not enumerating anymore, the pattern has failed.

Sure, it's possible to tweak the timestamps to make sure that all the elements gracefully spread over distinct values, but it's a very non-trivial property to enforce. Indeed, a datetime column isn't really supposed to be defined with unicity constraint in your database. It's feasible, but odd and error prone.

The fallacy of the "power" APIs

Some APIs provides powerful filtering and sorting mechanisms. Thus, through those mechanims, it is possible to correctly implement paging. For example by combining two filters: one the update datetime of the items and one on the item identifier. A correct implementation is far from trivial however.

Merely offering the possibility to do the right thing is not sufficient: doing the right thing should be the only one possibility. This point is something that Lokad learned the hard way early on: web APIs should offer one and only one way to do each intended operation.

If the API offers a page mechanism but that the only way to correctly implement paging is to not use it; then, rest assured that the vast majority of the client implementations will get it wrong. From a design viewpoint, it's like baiting developers into a trap.

The "continuation token" as the only pattern that works

To my knowledge, there is about only one pattern that works for paging, it's the continuation token pattern. 

https://example.com/api/purchaseorders?continue=token

Where every request to a paged resource like the purchase orders has the possibility of returning a continuation token on top of the elements returned when not all elements could be returned in one batch.

On top of being correct, that pattern has two key advantages:

  • It's very hard to get it wrong on the client side. There is only one way to do anything with the continuation token: it's to feed it again to the API.
  • The API is not commited into returning any specific number of elements (in practice a high upper bound can still be documented). Then, if some elements are particularly heavy or if the server is already under heavy workload, smaller chunks can be returned.

This enumeration should not provide any garantee that the same element won't be enumerated more than once. The only garantee that should be provided by the paging through tokens is that ultimately all elements will be enumerated at least once. Indeed, you don't want to end-up with tokens that embed some kind of state on the API side; and in order to keep it smooth and stateless, it's important to lift this constraint.

Then, continuation tokens should not expire. This property is important in order to offer the possibility to the client perform incremental update on a daily, weekly or even  on a monthly schedule depending on what makes sense from a business viewpoint.

No concurrency but data partitions

The continuation token does not support concurrent data retrieval: the next response has to be retrieved before being able to post the next request. Thus, in theory, this pattern somehow limit the amount of data that can be retrieved.

Well, it's somewhat true, and yet mostly irrevelant. First, Big (Business) Data is exceedingly rare in practice, as the transation data of the largest companies tend to fit on a USB key. For all the APIs that we have integrated, putting aside the cloud APIs (aka Azure or AWS), there was not a single integration where the amount of data was even to close to justifying concurrent data accesses. Slow data retrieval is merely a sign a non-incremental data retrieval.

Second, if the data is so large that concurrency is required, then, partitionning the data is typically a much better approach. For example, if you with to retrieve all the data from a large retail network, then the data can be partitionned per store. Partitionning will be making things easier both on the API side and on the client side.

Wednesday
Mar042015

Buying software? You should ignore references

Being a (small) software entrepreneur, it is still amazing to witness how hell is breaking loose when certain large software vendors start deploying their “solution”. Even more fascinating, is that after causing massive damage, the vendor just signs another massive deal with another large company and hell breaks loose again. Repeat this one hundred times, and you witness a world-wide verticalized software leader crippling an entire industry with half-backed technology.

Any resemblance between the characters in this post and any real retail company is purely coincidental.

I already pointed out that Requests For Quotes (RFQ) were a recipe for disaster, but RFQs alone do not explain the scale of the mess. As become more and more familiar with selling to large companies, I now tend to think that one heavyweight driver behind these epic failures is a banal flaw of the human mind: we massively overvalue other people’s opinion on a particular subject instead of relying on our own judgment.

In B2B software, one’s references usually come from is a person who works in a company similar to the one you are trying to sell to, and who, when called by your prospects, conveys exceptionally positive feelings about you and extremely vague information about your solution. Having tested this approach myself, I can say that the results are highly impressive: the reference call is an incredibly efficient sales method. Thus, it is pretty safe to assume that any sufficiently large software B2B vendor is also be acutely aware of this pattern as well.

At this point, for the vendor, it becomes extremely tempting not to merely stumble upon happy customers who happen to be willing to act as referees, but to manufacture these references directly, or even to fake them if it’s what it takes. How hard could this be? It turns out that it’s not hard at all.

As a first-hand witness, I have observed that there are two main paths to manufacturing such references, which I would refer to as the non-subtle path and the subtle path. My observations indicate that both options are routinely leveraged by most B2B software vendors once they reach a certain size.

The non-subtle path is, well, not subtle: you just pay. Don’t get me wrong, there is no bribery involved or anything that would be against the law. Your “reference” company get paid through a massive discount on its own setup fee, and is under a strict agreement that they will play their part in acting as a reference later on. Naturally, it is difficult to include this in the official contract, but it turns out that you don’t need to. Once a verbal agreement is reached, most business executives stick to the spirit of the agreement, even if they are not bound by written contract to do so. Some vendors go even a step further by directly offering a large referral fee to their flagship references.

The subtle path takes another angle: you overinvest in order to make your “reference” client happy. Indeed, usually, even the worst flaws of an enterprise software can be fixed given unreasonable efforts, that is, efforts that go well beyond the budget of your client. As a vendor, you still have the option to pick a few clients where you decide to overinvest and make sure that they are genuinely happy. When the time comes and a reference has to be provided, the reference is naturally chosen as one of those “happy few” clients who benefit from an outstanding service.

While one can be tempted to argue that the subtle path is morally superior to the non-subtle path, I would argue that they are both equally deceptive, because a prospect gets a highly distorted view of the service actually provided by the vendor. The subtle path has the benefit of not being a soul crushing experience for the vendor staff, but many people accommodate with the non-subtle path as well.

If you happen to be in a position of buying enterprise software, it means that you should treat all such hand-picked references with downright mistrust. While it is counter-intuitive, the rational option is to refuse any discussions with these references as they are likely to distort your imperfect (but so far unbiased) perception of the product to be acquired.

Refusing calls with references? Insanity, most will say. Let’s step back for one second, and let’s have a look at what can be considered as the “gold standard” [1] of rational assessment: the paper selection process of international scientific publications. The introduction of blind, and now double-blind, peer reviews was precisely motivated to fight the very same kind of mundane human flaws. Nowadays, if a research team was to try to get a paper published based on the ground that they have buddies who think that their work is “cool”, the scientific community would laugh at them, and rightly so. Only the cold examination of the work itself by peers stands ground.

And that is what references are: they are buddies of the vendor.

In addition, there is another problem with references that is very specific to the software industry: time is of the essence. References are a reflection of the past, and by definition, when looking at the past, you are almost certain to miss recent innovations. However, software is an incredibly fast-paced industry. Since I first launched Lokad, the software business for commerce has been disrupted by three major tech waves: cloud computing, multichannel commerce and mobile commerce; and that is not even counting “minor” waves like Big Data. Buying software is like buying a map: you don’t want an outdated version.

Software that is used to run large companies is typically between one and two decades behind what would be considered as “state of the art”. Thus, even if a vendor is selling technology that is one decade behind the rest of the market, this vendor can still manage to be perceived as an “upgraded” company by players who were two decades behind the market. It is a fallacy to believe that because the situation improved somewhat, the move to purchase a particular software was a good one. The opportunity to get up to speed with the market has been wasted, and the company remains uncompetitive.

No matter which approach is adopted by the vendor to obtain its references, one thing is certain: it takes a tremendous amount of time to obtain references, typically years. Thus, by the time a references are obtained, chances are high that the technology that has been assessed by the referee has now become outdated. At Lokad, it happened to us twice: by the time we obtained references for our “classic” forecasting technology, we had already released our “quantile” forecasting technology and our former “classic” forecasting software was already history. And three years later, history repeated itself as we released “quantile grids” forecasting that is vastly superior to our former “quantiles”. If companies were buying iPhone based on customer references, they would just be starting to buy the iPhone 1 now, not trusting iPhone 2 yet because it would still lack customer references; and it would be unimaginable to even consider all the different versions from iPhone 3 to iPhone 6 that have not yet been time-tested.

The need for references emerges because the software buyer is vulnerable and insecure, and rightly so, as epic failures are extremely frequent when buying enterprise software. While the need to obtain security during the buying process is real, references, as we have seen, is a recipe for major failures.

A much better approach is to carry out a thorough examination of the solution being proposed, and yes, this usually means becoming a bit of an expert in this field in order to perform an in-depth assessment of the solution being presented by the vendor. Don’t delegate your judgment to people you have no reason to trust in the first place.

[1] The scientific community is not devoid of flaws, it is still large bunch of humans after all. Peer reviewing is a research area in progress. Publication protocols are still being improved, always seeking to uphold higher standards of rationality.

Monday
Feb162015

Super-fast flat file parsing in C# and Java with a perfect hash function

At Lokad, (almost) all we do is to crunch flat text files. It's not that we haven't tried anything else - we did - many times - and it went poorly. Flat files are ubiquitous, well understood, and they yield very good performance both of the write side and the read side when working under tight budgets.

Keep in mind that the files we crunch are frequently generated by our clients, so while ProtoBuf or Cap'n Proto are very cool, asking our clients to deliver such formats would be roughly equivalent asking them to reimplement their in-house Java ERP in Haskell. To preserve the sanity of our clients, we keep it simple and we stick to flat files.

However we have decided to make flat file read fast, really fast. Thus, one of us decided to tackle the challenge dead-on, and came up with a very nice pattern: file parsing starts with a Perfect Hash Function preprocessing. Simply put, the flat file gets tokenized, and then each token gets replaced by an integer uniquely identifying this piece of string. Not only this saves a tremendous amount of string object instantiation, but afterward, all the complex parsing operations, such as parsing a date, can be performed only once, even if the token is encountered hundreds of times in the file. Performance-wise, it works because flat files tend to be very denormalized and very redundant.

We have released a tiny open source package codenamed Lokad.FlatFiles for C#/.NET (and a Java version too) under the MIT license. This library takes care of generating the perfect hashes out of a flat file. Our (unfair) benchmarks indicate that we typically reach about 30MB/second on a single CPU. Then, when the subsequent parsing operations take advantage of the token hashing, the speed-up is so massive that this initial perfect hashing tend to completly dominate the total CPU cost - so we stay at roughly 30MB/second.

Monday
Dec152014

A few lessons about pricing B2B apps

My own SaaS company has always been struggling with its own pricing. For a company now selling its own pricing optimization technology for commerce, this was a bit ironic. Well, pricing of software is unfortunately very unlike pricing goods in store, and the experience we acquired working with our retail clients improving their own prices provided little insights about the pricing of Lokad.

Since the creation of the company, Lokad has been offering a metered pricing, charging according to the amount of forecasts consumed. However, in practice for the last two years, we signed only a handfew contracts where the pay-as-you-go pricing had been actually preserved. In practice, the usage consumption as observed during the trial period was used as the starting point of the negotiation; and then the negotiation invariably converged toward a flat monthly fee.

Starting from today, we have extensively revised the pricing of Lokad toward a very simple list of packages only differentiated by the maximal size for the client companies.

For SaaS companies selling to businesses, the (almost) ubiquitous pricing pattern consist of charging per user; that's the approach of Salesforce, Google Apps, Office 365, Zoho and many more. However, sometimes, charging per user doesn't make sense, because the number of users can be made arbitrarily low, and does not reflect at all the usage of the service. All cloud computing platforms fall into this category.

Metered pricing only works with Über-geek clients

The cloud computing example is misleading because it gives the false impression that metered pricing is just fine. Metered pricing works for cloud computing platforms because their clients are very technical and can digest pricing logics 100x more complex than logics acceptable by "non-tech" businesses.

At Lokad, we have observed many times that the fear of doing a mistake and increasing the invoice tenfold was generally considered as a deal-breaker. Most companies don't even nearly trust as much their employees as software companies do trust their software developers. A metered pricing put an implicit high level of trust on the employees operating the metered service.

Flat monthly / quarterly / yearly fees are the way to go

Through dozens of negotiations with clients, some large, some small, and across many countries, we have always converged toward periodic fees to be paid every month, quarter or year. Sometimes, we did add an additional setup fee to reflect some extra-effort to be delivered by Lokad to setup the solution, but in 7 years of business, we had only a handfew contracts more complex than a flat setup fee followed by a flat period fee.

The lesson here is that anything more complex than setup fee + periodic fee is very prone to accidental complexity providing little or no business value for the software company and its client.

Don't cripple your software by restricting access to features

The "freemium" vision consists of offering a free version with limited features, and restricting the access to the more advance features to paying clients. Again, if you consider a software where it's natural to charge per user this approach might work; however, when the software is not user-driven, not granting access to all features just drags down your small clients - who have mostly the same needs than your bigger clients.

We learned that crippling our own apps was just bad. At the end of most negotiations with clients, we were nearly always ending up granting access to all features - like the highest paying plan - for most companies. Naturally, the price point was adjusted accordingly, but nevertheless, we observed many times that crippling the software was just a lose-lose approach.

It's fine to trust your clients by default

For years, at Lokad, we had relied on the implicit assumption that whatever metric were going to be used to define the boundaries between the subscription plans, this metric had to be tracked by the software itself. However, by narrowing our vision to the sole metrics that our software could track, we had eliminated the one metric which was truly making sense: charge according to the company turnover.

Our new plans are differentiated based on turnover, and yet, we have not automated way to measure the turnover. However, is it really a problem? I don't think so. Over the years, we have very (very) few companies trying to game our terms. Moreover my observations indicates that the larger the company, the less likely they are to even consider the possibility of cheating.

The logical conclusion is then to grant access to everything by default, and then to gently remind companies of your pricing terms when the opportunity arise. B2B isn't B2C, for the vast majority of B2B software, even if you don't put any protection in place, the service isn't going to be swarmed by corporate freeloaders.

If it does, well that's a rich man's problem.

Tuesday
Oct212014

How we ended up writing our own programming language

About one year ago, my company had the opportunity to expand into an area which was very new for us at the time : pricing optimization for commerce. Pricing optimization is quite different to demand forecasting; the latter being the original focus of Lokad at the beginning of the company’s existence. While demand forecasting fits rather nicely into quantitative frameworks that allow you to decide which forecasting methods are the most suitable for any given task pricing is a much more evasive problem. The idea that profits can be maximized by carrying out a simple analysis of demand elasticity is deceptive. Indeed, pricing is a signal sent to the market; and like with any marketing ingredient, there is not one valid answer to the problem.

One year ago, most of the companies that helped merchants manage their pricing were consulting companies, but I wanted to build a pricing webapp to help businesses deal with their pricing which would go beyond the classic consulting services on pricing. I quickly ruled out the idea of offering a template list of “pricing recipes”. Some competitors were already offering such “pricing recipe” services, and they were desperately inflexible. Merchants needed to be able to “tweak” their pricing logic in many ways. Thus, I started to consider more elaborate user interfaces that would allow merchants to compose their own pricing strategies. After pushing some efforts at mockups, I was ending up with something oddly similar to Microsoft Access “visual” query designer.

This was not a good thing. My limited interactions with this query designer, a decade prior, had left me with the lasting impression of this being just about the worst user experience I have ever had with the “normal” behavior of a product released by Microsoft. While it was supposedly a visual query editor with plenty of very visual buttons, but unless you had some knowledge of SQL or experience in programming, you weren’t going very far with this tool. In the end, anyone using Access was falling back on the non-visual query editor, which quite unfortunately, was a second-class citizen.

Gradually, I came to consider the possibility of going for a programming language instead. With a programming language, we could provide the desirable expressiveness, but also a powerful data environment. Lokad would be hosting the data along with offering a cloud-based execution environment for the pricing logic. This environment would not be constrained by the client’s desktop setup which can be too old, too weak, too messy or downright corrupted.

At first, I considered reusing an existing programming language such as JavaScript or Python. However, this presented two very specific challenges. The first challenge was security. Running server-side client code seemed like a giant vector for entire classes of injection attacks. In theory, it should be possible to sandbox the execution of any program, but my instincts were telling me that the surface attack area was so great we would never be confident enough about not having leaks in our sandbox. So, we would have to leverage disposable VMs for every execution, and it seemed that an endless stream of technical problems was heading our way if we were to implement this.

The second, and in fact bigger, problem is that JavaScript or Python being full-fledged programming languages are also complex languages, which include truckloads of features downright irrelevant for the pricing use cases that I was considering: loops, objects, exceptions, nil references. No matter how much we would try to steer the usage of our future product away from these elements, I felt that they would invariably resurface again and again because some our future users would be familiar with just these languages, and as a result, they would do things the way they were used to doing them before. It is tough to debug a generic programming source code, so the tooling would necessarily end up being complex as well.

This left me with the prospect of inventing a new programming language, and yet this idea was accompanied by all the red flags possible in my mind. Usually, for a software company, inventing its own programming language is a terrible idea. Actually, I had witnessed quite closely three companies who had rolled out their own respective programming languages, and for each one of these companies, the experience was very poor to say the least . Two of them managed to achieve a certain level of success nonetheless, but the ad hoc nature of the language had been a huge hindrance to the process. Moreover, about every single experience I ever had with niche programming languages (hello X++) confirmed that an ad hoc language was a terrible idea.

However, as far pricing and commerce were concerned, a generic programming language was not required, and we were hopeful that through extreme specialization, we could produce a highly specialized language that would, all being well, compare favorably with the mainstream languages, at least for the narrow scope of commerce.

And thus, the Envision programming language was born at Lokad.

Unlike a generic programming language, Envision is designed around a relatively rigid and relatively simple data model that we knew to work well for commerce. My intent was to be able to reproduce all the domain-specific calculations that nearly all merchants were doing in Microsoft Excel, but putting some distance between the logic and the data - but not too much distance either. Indeed, Envision, as the name suggests, is also heavily geared toward data visualization; again, not generic data visualization, but things that matter for commerce.

Envision has no loops, no branches, no nulls, no exceptions, no objects … and it does just fine without them. It is not Turing-complete either, so we do not end up with indefinite execution delays.

Less than one year after starting to write the first line of code for the Envision compiler, we have now secured over a million Euros through multi-year contracts that are to be (almost) completely implemented in Envision. To be transparent, it is not the language that clients are buying, but the service to be built with it. Nevertheless, over the last couple of months, we have been able to deliver all kinds of quantitative optimizations with Envision - not just pricing actually – and within timeframes that we would never had achieved in the past.

There are a two takeaways lessons to be learned from this initiative. First, ultra-specialized languages are still a valid option for vertical niches. While it is a very rough estimate, I would say that with Envision, when dealing with a suitable challenge, we end-up with about 50 times fewer lines of code than when the same logic is implemented in C#, the primary programming language used at Lokad. Yes, using a functional language like F# would already make the code more compact than C#, but it would still be far from being that compact. Also, with Envision, we get more concise code not because of we leverage highly abstract operators, but merely because the language itself is geared towards the exact problem to be addressed.

Second, when introducing a programming language, it should not be half-baked. Not only the language itself needs the good ingredients known to computer science – a context-free grammar written in Backus-Naur form for example; but a good integrated programming environment with code auto-completion and meaningful error messages is also needed. The language is only the first ingredient; it is the tooling around the language that makes the difference in terms of productivity. From the very beginning, we invested in the environment as much as did we invest in the programming language.

Also, having a webapp as the primary execution environment for your language opens up a lot of new possibilities. For example, unless you spend years polishing your syntax before releasing anything, you are bound to make design mistakes. We certainly did. Then, as all the Envision scripts were also hosted by Lokad, it was possible for us to rectify those mistakes, first by fixing the language, and second, by upgrading all the impacted scripts from the entire user base. Sure, it takes time, but better to spend a few days on this early on, as opposed to end up with broken language forever.

I have not delved much into the details of the Envision language itself in this post, but in case you would be interested, I have just published a book about it. The preview gives you access to the entire book too.

Ps: while I credit myself for initiating the Envision project at Lokad, it is actually a colleague of mine, Victor Nicollet, presently the CTO of Lokad, who came up with nearly all the good ideas for the design of this language and who carried about 90% of the implementation effort.