<?xml version="1.0" encoding="UTF-8"?>
<!--Generated by Squarespace V5 Site Server v5.13.159 (http://www.squarespace.com) on Fri, 24 May 2013 07:40:21 GMT--><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0"><channel><title>Joannes Vermorel's blog</title><link>http://vermorel.com/journal/</link><description>Cloud computing, machine learning and startups</description><lastBuildDate>Mon, 29 Apr 2013 15:08:55 +0000</lastBuildDate><copyright></copyright><language>en-US</language><generator>Squarespace V5 Site Server v5.13.159 (http://www.squarespace.com)</generator><item><title>Big Data: choosing the problem before choosing the solution</title><category>Software Business</category><category>Software Engineering</category><category>bigdata</category><dc:creator>Joannes Vermorel</dc:creator><pubDate>Wed, 03 Oct 2012 10:17:31 +0000</pubDate><link>http://vermorel.com/journal/2012/10/3/big-data-choosing-the-problem-before-choosing-the-solution.html</link><guid isPermaLink="false">450038:5042156:29617638</guid><description><![CDATA[<p><em>My <a href="http://www.lokad.com/">company</a> has started several important <a href="http://www.lokad.com/big-data-consulting.ashx">big data</a> missions, and I am taking here the opportunity publish some insights are are relevant to all those initiatives.</em></p>
<p>A major (and frequent) pitfall of the Big Data projects consists of <strong>starting with a solution</strong>&nbsp;instead of starting with a <em>problem</em>. In particular, software vendors (Lokad's included) are pushing their own Big Data recipe which will randomly involve:</p>
<ul>
<li>Hadoop</li>
<li>SAP HANA</li>
<li>HBase</li>
<li>Amazon EC2</li>
<li>Cassandra</li>
<li>Windows Azure</li>
<li>Storm</li>
<li>Node.js</li>
<li>...</li>
</ul>
<p>However, the notion of "Big" data is very relative: <strong>cheap 1TB hard-drives are now available</strong> at your nearest supermarket, and very <em>very </em>few problems faced by companies, even <em>very&nbsp;</em>large ones, do require require more than 100 GB of data to process.&nbsp;</p>
<p>Usually, even the largest data sources of the largest companies do <a href="http://blog.lokad.com/journal/2012/6/7/running-a-very-large-retail-network-on-a-smartphone.html">fit on a smartphone</a>&nbsp;when properly represented.&nbsp;</p>
<h3>Impedance mismatch of BIG frameworks</h3>
<p>The performance achieved by well-known Big Data frameworks are mind-blowing: Facebook claims to process <a href="http://en.wikipedia.org/wiki/Apache_Hadoop#Yahoo.21">100PB of data</a> over Hadoop. That's <strong>massive</strong>, and massively impressive as well.</p>
<p>However, before jumping on Hadoop (or any similar Big Data frameworks), one has to really estimate the friction costs involved. While Hadoop is certainly simpler than say <a href="http://en.wikipedia.org/wiki/Message_Passing_Interface">MPI</a>, it's still a complicated distributed framework which do require a lot of skills to be properly and efficiently operated.</p>
<p>If the very same goal can be achieved on a single machine within a very acceptable timeframe, then, in my experience the <em>dumb</em>&nbsp;solution is going to be about 100x cheaper (*) and easier to run and to maintain compared to the "distributed" variant.</p>
<p><em>(*) I am not refering to hardware costs, but to wetware costs (aka people) which represents 99% of the cost anyway for virtually every company, minus a few social networks and search engines.</em></p>
<p>The&nbsp;<em>untold</em>&nbsp;story about Hadoop (and its peers) is that it works only <em>if,</em>&nbsp;<em>and only if,</em>&nbsp;the data is very meticuluously organized to be made suitable for a processing through the framework. If the data is incorrectly partioned, then Hadoop plus thousands of servers are no faster than a single machine.</p>
<h3>Enterprise Big Data start at 100MB</h3>
<p>Facebook is facing Petabytes of data, that's millions of Gigabytes, but is really your company facing that much data? Do you need to plug that much data <em>in</em>&nbsp;to solve the problem at hand?&nbsp;Unless you work for a short list of about 100 companies on Earth, I seriously doubt it.</p>
<p>I observe that for most entreprises, "Big Data" starts at 100MB when:</p>
<ul>
<li>Excel is no more a solution.</li>
<li>SQL is no more a solution (*)</li>
</ul>
<p><em>(*) Yes, you can have a lot more than 100MB in a SQL database.&nbsp;However, reading the entire dataset through SQL needs to be done with care to avoid re-scanning the data thousands of times. In practice, in 90% of the data crunching situations, I observe that it's easier to remove&nbsp;the SQL database, as opposed to improve&nbsp;the performance of the queries over the relational database.</em></p>
<h3>Facing the problems</h3>
<p>Thus, whenever data is involved, the initiative should start by facing the problems that are the true roadblock to deliver a "solution". Those problems are typically:</p>
<ul>
<li><strong>Collecting and servicing the data:</strong>&nbsp;About every single company I visit has problems on collecting and servicing the data. The most obvious symptom is typically the&nbsp;<em>lack of documentation </em>concerning the data itself, and all the nitty-gritty insights to need to make anything of it. No technology is going to solve <em>that</em>&nbsp;problem, only people and process.</li>
<li><strong>Choosing the metrics to be optimized</strong>:<strong>&nbsp;</strong>They are so many parts of the business that could be improved through a smart exploitation of the data, that it is extremely tempting to think that some (hype) technology might be THE answer to everything. <em>This</em> is not going to happen. Solving a problem through data is tough, and without&nbsp;<em>metrics</em>, you don't even for sure you're moving in the right direction. Frequently, defining the <em>metric</em>&nbsp;- that is the problem to be solved - is <em>harder</em> than implementing the solution.&nbsp;</li>
</ul>
<p>Thus, before jumping to next cool vendor solution, I urge to start by facing the very <em>uncool</em>&nbsp;aspects of the problem. Frequently, the "solution" consists of <a href="http://en.wikipedia.org/wiki/Anti-pattern">removing an ingredient</a> of the previous solution.</p>]]></description><wfw:commentRss>http://vermorel.com/journal/rss-comments-entry-29617638.xml</wfw:commentRss></item><item><title>A few tips for Big Data projects</title><category>Software Business</category><category>Software Engineering</category><category>bigdata</category><dc:creator>Joannes Vermorel</dc:creator><pubDate>Mon, 25 Jun 2012 10:16:23 +0000</pubDate><link>http://vermorel.com/journal/2012/6/25/a-few-tips-for-big-data-projects.html</link><guid isPermaLink="false">450038:5042156:16981380</guid><description><![CDATA[<p><img style="float: left; margin-right: 10px;" src="http://vermorel.com/storage/floppy-disk.png" alt="Floppy disk illustration" />At Lokad, we are routinely working on&nbsp;<a href="http://www.lokad.com/big-data-consulting.ashx">Big Data projects</a>, primarily for retail, but with occasional missions in energy or biotech companies. Big Data is probably going to remain as one of the big buzzword of 2012, along with a <strong>big trail of failed projects</strong>. A while ago, I was offering&nbsp;<a href="http://vermorel.com/journal/2010/12/22/a-few-tips-for-web-api-design.html">tips for Web API design</a>, today, let's cover some&nbsp;<strong>Big Data</strong>&nbsp;lessons (learned the hard way, as always).</p>
<h3>1. Small Data trump Big Data</h3>
<p>There is one area that captures most of the community interest: <strong>web data</strong> (pages, clicks, images). Yet, the web-scale, where you have to deal with petabytes of data, is <strong>completely unlike 99% of the real-world problems</strong> faced about every other verticals beside&nbsp;<em>consumer internet</em>.&nbsp;</p>
<p>For example, at Lokad, we have found that the largest datasets found in retail could still be <a href="http://blog.lokad.com/journal/2012/6/7/running-a-very-large-retail-network-on-a-smartphone.html">processed on a smartphone</a>&nbsp;if the data is correctly represented. In short, for the overwhelming majority of problems, the relevant data, once properly partitioned, take <strong>less than 1GB</strong>.</p>
<p>With datasets smaller than 1GB, you can <strong>keep experimenting on your laptop</strong>. Map-reducing stuff on the cloud is cool, but compared to local experiments on your noteboook, cloud productivity is abysmal.</p>
<h3>2. Smarter problems trump smarter solutions</h3>
<p>Good developers love finding good solutions. Yet,when facing Big Data problem, it just too temping to <em>improve</em>&nbsp;stuff, as opposed to <strong>challenge the problem</strong>&nbsp;in the first place.</p>
<p>For example at Lokad, as far <em>inventory optimization</em> was concerned, we have been <a href="http://blog.lokad.com/journal/2012/3/12/quantiles-inventory-optimization-20.html">pushing years of efforts at solving the wrong problem</a>. &nbsp;Worse, our competitors has been spending <em>hundreds</em>&nbsp;of man-years of efforts doing the same mistake ...</p>
<p><em>Big Data</em> means being capable of processing large quantities of data while keeping computing resource costs negligible. Yet, most <em>problems</em>&nbsp;faced in the real world have been <em>defined</em>&nbsp;more than 3 decades ago, at a time where <em>any</em>&nbsp;calculation (no matter how trivial) was a challenge to automate. Thus, those <em>problems</em>&nbsp;come with a strong bias toward <em>solutions</em>&nbsp;that were conceivable at the time.</p>
<p>Rethinking those <em>problems</em>&nbsp;is long overdue.</p>
<h3>3. Being non-intrusive is scalability-critical</h3>
<p>The scarcest resource of all is human time. Letting a CPU chew 1 million numbers is nothing. Having people <em>reading</em>&nbsp;1 milion numbers takes an army of clercs.&nbsp;</p>
<p>I have already posted that manpower requirements of Big Data solutions were the most frequent <a href="http://blog.lokad.com/journal/2012/1/3/big-data-in-retail-a-reality-check.html">scalability bottleneck</a>. Now, I believe that if any human has to read numbers from a Big Data solution, then solution won't scale. <em>Period</em>.</p>
<p>Like AntiSpam filters, Big Data solutions need to tackle problems from an angle that does not require any attention from anyone. In practice, it means that problems have to be <em>engineered</em>&nbsp;in a way so that they can be solved without user attention.&nbsp;</p>
<h3>4. Too big for Excel, treats as Big Data</h3>
<p>While the community is frequently distracted by multi-terabyte datasets, anything that does not conveniently fit in Excel <em>is</em>&nbsp;Big Data as far practicalities go:</p>
<ul>
<li>Nobody is going to have a look at that many numbers.</li>
<li>Opportunities exist to solve a&nbsp;<em>better</em>&nbsp;problem.</li>
<li>Any non-quasi-linear algorithm will fail at processing data in a reasonable amount of time.</li>
<li>If data is poorly architectured / formatted, even sequential reading becomes a pain.</li>
</ul>
<p>Then comes the question: <em>how should handle Big Data?</em>&nbsp;However, the answer is typically very domain-specific, so I will leave that to a later post.</p>
<h3>5. SQL is not part of the solution</h3>
<p>I won't enter (here) the debate <em>SQL vs NoSQL</em>, instead let's outline that whatever persistence approach is adopted, it won't help:&nbsp;</p>
<ul>
<li>figuring out if the problem is the proper one to be addressed,</li>
<li>assessing the usefulness of the analysis performed on the data,</li>
<li>blending Big Data outputs into user experience.</li>
</ul>
<p>Most of the discussions around Big Data end up <strong>distracted&nbsp;by persistence strategies</strong>. Persistence is a very <em>solvable</em>&nbsp;problem, so engineers <em>love</em> to think about it. Yet, in Big Data, it's the <a href="http://en.wikipedia.org/wiki/Wicked_problem">wicked parts</a> of the problem that need the most attention.</p>]]></description><wfw:commentRss>http://vermorel.com/journal/rss-comments-entry-16981380.xml</wfw:commentRss></item><item><title>Happy talk detector</title><category>Social best practices</category><category>Software Business</category><category>Website Usability</category><category>writing</category><dc:creator>Joannes Vermorel</dc:creator><pubDate>Mon, 21 May 2012 12:02:26 +0000</pubDate><link>http://vermorel.com/journal/2012/5/21/happy-talk-detector.html</link><guid isPermaLink="false">450038:5042156:16366940</guid><description><![CDATA[<p>Over the last couple of months, I have been pushing a lot of content on my company website (<a href="http://www.lokad.com/">Lokad.com</a>), and proofreading a lot of texts produced by colleagues too. The more I write, the more I realize that fighting our <a href="http://vermorel.com/journal/2009/4/18/9-steps-to-make-sure-your-startup-exists.html">innate instinct to produce happy talk</a>&nbsp;is a tough battle.</p>
<p>Recently, I came up with a simple rule to detect most&nbsp;<em>happy talk</em>&nbsp;content:</p>
<blockquote>
<p>When by replacing a sentence by its negation, the resulting message seems totally out of place, then, odds are that the sentence was not carrying much of a message in the first place.</p>
</blockquote>
<p>For example, it might be tempting write down on a company website&nbsp;<em>We strive for excellency</em>; however, if you think the opposite <em>We strive for mediocrity</em>, it becomes clear that nobody would claim the latter version. Hence, since the latter&nbsp;<em>is</em> obvious, the former has to be too.</p>
<p>The trick is purely psychological though. When producing an assertion of some kind, our mind - at least mine for sure - seems to better spot&nbsp;<em>oddities</em>&nbsp;rather than to recognize <em>the obvious</em> as such.&nbsp;</p>]]></description><wfw:commentRss>http://vermorel.com/journal/rss-comments-entry-16366940.xml</wfw:commentRss></item><item><title>Bizarre pricing, does it matter? (B2B)</title><category>Lokad</category><category>Lokad</category><category>Software Business</category><category>pricing</category><dc:creator>Joannes Vermorel</dc:creator><pubDate>Mon, 19 Mar 2012 20:08:45 +0000</pubDate><link>http://vermorel.com/journal/2012/3/19/bizarre-pricing-does-it-matter-b2b.html</link><guid isPermaLink="false">450038:5042156:15427349</guid><description><![CDATA[<p>My company has just released <a href="http://blog.lokad.com/journal/2012/3/12/quantiles-inventory-optimization-20.html">quantile forecasts</a>&nbsp;upgrade. It's no less than a small revolution for us, however, unless you've got some inventory to manage, it's probably not too relevant to your business.</p>
<p>Another salient aspect is our&nbsp;<a href="http://www.lokad.com/pricing.ashx">new pricing for quantiles</a>&nbsp;(the old pricing for classic forecasts remains untouched). Lokad is selling a monthly subscription, and if $q_i$ represents one of the actual <em>quantile values</em> retrieved by the client during the month, then the monthly cost $C$ is given by:</p>
<p>$$C = $0.15 \times \left(\sum_{i=0}^n q_i^{2/3} \right)^{2/3}$$</p>
<p><del>We hesitated to round 0.15 as $\frac{\pi}{2}$ because formula look better with Greek letters.</del>&nbsp;Obviously, it's not <em>simple</em>, and most people would go as far as saying it's <strong>downright obscure</strong>, but it is really a <strong>good pricing, or just plain insanity?</strong></p>
<p>To understand a bit where Lokad is coming from, let's start with the fact that we are a <strong>B2B software</strong> company. About 95% of competitors don't have any kind of <em>public</em> pricing: you can only ask for a <em>quote</em>, and then a talented sales guy will contact you to figure out your <em>maximum budget</em>, only to get back to you with a quote at 120% of the figure you gave him.</p>
<p>However, I strongly favor public pricing, not because it's more transparent, honest, fair, whatever, but because it's a massive&nbsp;<strong>time saver</strong>. At Lokad, we don't enter into time-consuming pricing negotiations except for the largest clients, where it does make sense to spend time negotiating.</p>
<p>The cardinal rule of software pricing is that it should capture the <strong>willingness to pay</strong>&nbsp;of the client, which, in B2B, is typically related to the economic gains generated by the usage of the product. In the case of demand forecasting, benefits can be <a href="http://www.lokad.com/accuracy-gains-(inventory).ashx">accurately computed</a>. However, turning this&nbsp;<em>forecasting&nbsp;benefits</em>&nbsp;formula into a pricing formula is insaly complex in the general case.</p>
<p>Hence, we decided to settle for&nbsp;<strong>heuristics</strong>&nbsp;that somehow mimic this theoretical willingness to pay, ran many simulations over our existing customer base, and finally figured out the formula. I do not claim that this pricing formula is <em>optimal</em>&nbsp;in any way: it is not. However, it does bring a very reasonable pricing for clients ranging from 1-man companies to 100,000+ employees companies.</p>
<p>Pros:</p>
<ul>
<li><em>(As far we can judge)</em> It's aligned with the value Lokad creates for clients.</li>
<li>It's still simple enough to be memorized in 20s.</li>
<li>It does not put incentive to <em>game</em>&nbsp;the pricing by excluding slow movers (i.e. products with low sales) from the forecasting process.</li>
<li>There is no <em>threshold</em>&nbsp;effect, where the pricing jumps to a much larger number just because the company has 1 more product than what the license would support.</li>
</ul>
<p>Cons:</p>
<ul>
<li>It certainly falls into the category of <em>bizarre</em>&nbsp;pricing.</li>
<li>The only way to know <em>for sure</em> the real monthly cost is to give a try (1).&nbsp;</li>
<li>Some prospects try the pricing formula on their own, and get it wrong (2).</li>
</ul>
<p>(1) This statement applies to most metered SaaS, even if the pricing is linear. For example, at Lokad we had very little clue about our exact bandwidth consumption until we migrated toward the cloud (with dedicated servers, bandwidth was part of the package).</p>
<p>(2) I believe this partly explains why 95% of our competitors don't put any public price on display. That, and the fact that a very expensive pricing is likely to scare away prospects, before getting the chance of cornering them into the sales process.</p>
<p><em>I would be interested to see if other B2B niches have designed their own bizarre pricing formulas. Don't hesitate to submit them in comments.</em></p>]]></description><wfw:commentRss>http://vermorel.com/journal/rss-comments-entry-15427349.xml</wfw:commentRss></item><item><title>Cloud questions from Syracuse University, NY</title><category>Cloud Computing</category><category>Lokad</category><category>Lokad</category><category>cloudcomputing</category><dc:creator>Joannes Vermorel</dc:creator><pubDate>Wed, 22 Feb 2012 16:45:58 +0000</pubDate><link>http://vermorel.com/journal/2012/2/22/cloud-questions-from-syracuse-university-ny.html</link><guid isPermaLink="false">450038:5042156:15143223</guid><description><![CDATA[<p>A few days ago, I received a couple of questions from a student of Syracuse University, NY who is writing a paper about cloud computing and virtualization. Questions are relatively broad, so I am taking the opportunity to directly post here the answers.</p>
<blockquote>
<p>What was the actual technical and business impact of adopting cloud technology?</p>
</blockquote>
<p>The technical impact was a complete rewrite of our codebase. It has been the large upgrade ever undertaken by Lokad, and it did span over 18 months, more or less mobilizing the entire dev workforce during the transition.</p>
<p>As far business is concerned, it did imply that most of the business of Lokad during 2010 (the peak of our cloud migration) has been stalled for a year or so. For a young company, 1 year of delay is a <em>very</em> long time.&nbsp;</p>
<p>On the upside, before the migration to the cloud, Lokad was stuck with SMBs. Serving any mid-large retail network was beyond our technical reach. With the cloud, processing super-large retail networks had become feasible.&nbsp;</p>
<blockquote>
<p>What, if any, negative experience did Lokad encounter in the course of migrating to the cloud?</p>
</blockquote>
<p>Back in 2009, when we did start to ramp up our cloud migration efforts, the primary problem was that none of us at Lokad had any in-depth experience of what the cloud <em>implies</em>&nbsp;as software architecture is concerned. Cloud computing is not just <em>any kind</em>&nbsp;of distributed computing, it comes with a rather <a href="http://vermorel.com/journal/2011/4/5/a-few-design-tips-for-your-nosql-app.html">specific mindset</a>.</p>
<p>Hence, the first obstacle was to figure out by ourselves patterns and practices for enterprise software on the cloud. It has been a tedious journey to end-up with <a href="http://lokad.github.com/lokad-cqrs/">Lokad.CQRS</a>&nbsp;which is roughly the 2nd generation of native cloud apps. We rewrote everything for the cloud once, and then we did it again to get sometime simpler, leaner, more maintainable, etc.</p>
<p>Then, at present time, most our recurring cloud problems come from integrations with legacy <em>pre-Web</em>&nbsp;enterprise software. For example, operating through&nbsp;<a href="http://en.wikipedia.org/wiki/Virtual_private_network">VPNs</a>&nbsp;from the cloud tends to be a huge pain. In contrast, modern apps that offer REST API are a much more natural fit for cloud apps, but those are still rare in the enterprise.</p>
<blockquote>
<p>From your current perspective, what, if anything, would you have done differently?</p>
</blockquote>
<p>Tough question, especially for a data analytics company such as Lokad where it can take 1 year to figure out the <a href="http://vermorel.com/journal/2011/10/14/oddities-of-machine-learning-software-code.html">100 <em>magic</em> lines of code</a> that will let you outperform the competion. Obviously, if we had to rewrite <em>again</em>&nbsp;Lokad from scratch, it would take us much less time. However it would be dismissing that the bulk of the effort has been the R&amp;D that made our forecasting technology <em>cloud native</em>.</p>
<p>The two technical aspects where I feel we have been hesitating for too long were SQL and SOAP.</p>
<ul>
<li>It took us too long to decide to ditch SQL entirely in favor of some native cloud storage (basically the Blob Storage offered by Windows Azure).</li>
<li><a href="http://vermorel.com/journal/2010/12/22/a-few-tips-for-web-api-design.html">SOAP was a somewhat similar case</a>. It took us a long time to give up on SOAP in favor of REST.</li>
</ul>
<p>In both cases, the problem was that we had (or maybe it was <a href="http://www.lokad.com/aboutus.ashx">just me</a>) not been fully accepting the <em>extent</em>&nbsp;of the implications of a migration toward the cloud. We remained stuck for months with older paradigms that caused a lot of uneeded frictions. Giving up on those from Day 1 would have save a lot of efforts.</p>]]></description><wfw:commentRss>http://vermorel.com/journal/rss-comments-entry-15143223.xml</wfw:commentRss></item><item><title>Goodbye Subversion, you served me well</title><category>Software Business</category><category>svn</category><dc:creator>Joannes Vermorel</dc:creator><pubDate>Mon, 23 Jan 2012 14:43:39 +0000</pubDate><link>http://vermorel.com/journal/2012/1/23/goodbye-subversion-you-served-me-well.html</link><guid isPermaLink="false">450038:5042156:14695401</guid><description><![CDATA[<p>I had been a <strong>long time Subversion</strong> user even before I started my company. Since 2006, the data analytics core of Lokad had been managed over SVN which proved to be a very robust piece of software (combined with TortoiseSVN).</p>
<p>We had a few hiccups where the easiest way forward was to delete the local version and check-out again, but otherwise, our <a href="http://hosted-projects.com/">SVN hoster</a>&nbsp;has been <strong>operating flawlessly over 5 years</strong>, which is a <em>long time</em>&nbsp;as far software technology goes.</p>
<p>After more than <strong>13,000 commits over SVN</strong>, we have finally migrated the forecasting core, the 2nd most complex software part, right after <a href="http://abdullin.com/cqrs">accounting and billing</a> :-) toward Git.</p>
<p>Internally, after <strong>hesitating a lot between Mercurial and Git</strong>, we finally opted for Git primarily because of&nbsp;GitHub&nbsp;where we now host our <a href="https://github.com/lokad">open source projects</a>.&nbsp;</p>
<p>There is a bit of nostalgia looking at good old tools depart. I am wondering whether Git will last for the next half-decade, or it will be supplanted by something that will make it look pale in comparison.&nbsp;</p>
<p>My personal take for the next 5 years:&nbsp;<strong>Git will stay but the technology battle will displace itself toward the collaborative tools</strong> that operate over Git (or Hg).</p>]]></description><wfw:commentRss>http://vermorel.com/journal/rss-comments-entry-14695401.xml</wfw:commentRss></item><item><title>MathJax, at last a decent way to post maths on the web</title><category>Algorithms</category><category>Website Usability</category><category>mathml</category><dc:creator>Joannes Vermorel</dc:creator><pubDate>Wed, 11 Jan 2012 09:53:39 +0000</pubDate><link>http://vermorel.com/journal/2012/1/11/mathjax-at-last-a-decent-way-to-post-maths-on-the-web.html</link><guid isPermaLink="false">450038:5042156:14533244</guid><description><![CDATA[<p>For a long time, <strong>posting something as simple as a square root on the web has been a major pain</strong>. Despite&nbsp;<a href="http://www.w3.org/Math/">MathML</a>&nbsp;having been around for years,&nbsp;Firefox is still the only browser (that I know of) to <a href="http://www.w3.org/Math/XSL/csmall2.xml">render MathML</a> correctly.</p>
<p>$$p=\Phi\left(\sqrt{2\ln\left(\frac{1}{\sqrt{2\pi}}\frac{M}{H}\right)}\right)$$</p>
<p>Recently, I did stumble upon <a href="http://www.mathjax.org/">MathJax</a>, an <strong>outstanding JavaScript rendering engine for mathematics</strong> that works for all major recent browsers. The syntax is derived from the one of LaTeX, and the output is either MathML (if you have Firefox) or plain HTML/CSS otherwise.</p>
<p>Thanks to MathJax, I have been able to post a long delayed analysis about <a href="http://www.lokad.com/service-level-definition-and-formula.ashx">optimal service levels</a>&nbsp;(that the illustrating formula here above) and <a href="http://www.lokad.com/economic-order-quantity-eoq-definition-and-formula.ashx">economic order quantity</a>. Kudos to the MathJax team!</p>]]></description><wfw:commentRss>http://vermorel.com/journal/rss-comments-entry-14533244.xml</wfw:commentRss></item><item><title>Instant transfer with Bitcoin but without 3rd parties</title><category>Algorithms</category><category>Software Business</category><category>bitcoin</category><dc:creator>Joannes Vermorel</dc:creator><pubDate>Tue, 20 Dec 2011 15:01:35 +0000</pubDate><link>http://vermorel.com/journal/2011/12/20/instant-transfer-with-bitcoin-but-without-3rd-parties.html</link><guid isPermaLink="false">450038:5042156:14193327</guid><description><![CDATA[<p><em>Update 2012-05-17: Double spending can be made extremely difficult through quasi-instant double spending attempt detection. See <a href="http://transactionradar.com/">TransactionRadar.com</a> as an illustration. I now believe that the ideas posted below are moot, because early double spending detection is just the way to go.</em></p>
<p>Bitcoin is a crypto-currency (check out my previous post for some more <a href="http://vermorel.com/journal/2011/8/3/bitcoin-thoughts-on-a-nascent-currency-system.html">introductory thoughts</a>) that provides many desirable properties such as decentralization, very low transaction fee, digital-native, ... However enabling<strong>&nbsp;instant payment</strong>&nbsp;has not been a <em>forte</em>&nbsp;of Bitcoin so far. It's very noticeable that people did even <a href="https://www.bitinstant.com/">raise funds</a> to address this problem with a trusted 3rd party setup.</p>
<blockquote>
<p>In this post, I will try to describe <strong>a convention that would offer instant (1) secure (2) decentralized (3) transactions with Bitcoin (4)</strong>.</p>
</blockquote>
<p>Let's start by clarifying the scope of this claim:</p>
<ol>
<li><strong>Instant</strong>. There is no such thing as <em>real-time</em>&nbsp;on the Internet, if only because of speed of light. Here, I am considering as <em>instant</em>&nbsp;anything <em>below 10 seconds</em>, which would be sufficient for the vast majority of the mundane use of a currency such as shopping.</li>
<li><strong>Secure</strong>. With Bitcoin, a transaction can be propagated in the network within seconds, yet, the transaction only becomes <em>secured</em>&nbsp;- aka with no further possibility of double spending - once the transaction has been included into the blockchain (6 blocks inclusion being the default of the Bitcoin client). Obviously, this requirement<em> somewhat conflicts with the previous one</em>, because 6 blocks represents about 1h on average (10min per block being the target speed of Bitcoin).</li>
<li><strong>Decentralized</strong>. The solution to reconciliate 1 and 2 should not rely on a trusted 3rd party. I hold no grudge against BitInstant, but if a solution exists to do the same thing without middlemen, then I believe it will only make Bitcoin stronger.</li>
<li><strong>Bitcoin</strong>. The solution should preserve the Bitcoin protocol as it exists today, requiring no upgrade of the community, <em>except for those who would like to leverage instant payments</em>. It's a <em>convention</em>&nbsp;in the usage of Bitcoin that I am referring to: it fits into the existing protocol spec. Those who don't want to follow this convention can safely ignore the whole thing.</li>
</ol>
<p><em>Disclaimer: I am neither a cryptograph nor a security expert, merely an enthusiast Bitcoin user.</em></p>
<p>The core idea of my proposal is to introduce a twist in the notion of <em>security</em>: <strong>instead of a strict prevention of double spending, let's make double-spending more expensive that the expected benefit.&nbsp;</strong>Indeed, if double-spending becomes possible but only a steep cost (cost being expressed in Bitcoin too) then there is no incentive to actually make any <strong>widespread</strong> use of the double-spending trick for instant payments. With this twist, we accept the <em>possibility</em>&nbsp;of double spending, but only because it's highly innefficient for the attacker. It will not prevent a&nbsp;<em>crazy</em>&nbsp;attacker to do some damage, but from a global perspective, the overal damage <em>through this twist&nbsp;</em>should stay insignificant (because there are so many <a href="http://www.forbes.com/sites/andygreenberg/2010/09/13/chinese-botnet-sells-point-and-click-cyberattacks/">better ways</a> to wreak havoc anyway if you're willing to spend money on the case).</p>
<p>For the convention that reconcilitate 1, 2, 3 and 4, I use two ingredients:</p>
<ul>
<li>A Bitcoin address that is <em>provably expensive</em>: the setup cost of the address is X BTC.&nbsp;</li>
<li>A mechanism to check that garantees that no double-spending attack to place for the address in the past (blockchain-wise).</li>
</ul>
<p>Usual Bitcoin addresses are&nbsp;<em>quasi-free</em>&nbsp;(the CPU cost to generate a new address is negligible), but it's not difficult to produce a Bitcoin address that comes with a <em>provable</em>&nbsp;cost. The easiest way is go for <em>monetary destruction</em> with a transaction that targets <em>/null</em>.&nbsp;Yet, destroying coins is not entirely satisfying.&nbsp;</p>
<p>Thus, in order to <em>prove</em>&nbsp;the value of the address AX,&nbsp;I propose to have <strong>a transaction, originating from a single address 1A only (only 1 input) that <em>by convention</em>&nbsp;redistribute its value to the <em>coinbase</em>&nbsp;address (*) of&nbsp;10 consecutive blocks</strong>&nbsp;that are less than 1 month old (at the time of the proof).</p>
<p><em>(*) It's the address of the first transaction of the block used by the miner himself to capture its reward.</em></p>
<p>Indeed, we cannot rely on transaction fee alone to <em>prove</em>&nbsp;the cost of address, because a miner could decide to create a ficticious high-fee transaction in a block - fictictious in the sense that the fee would cost nothing to the miner, who would immediately recover the fee through the ownership of the block.</p>
<p>Yet, by targeting 10 <em>consecutive</em>&nbsp;blocks, we prevent any miner to <em>fully</em> self-reward itself with the transaction. Indeed, blocks are assigned based on a lottery where the odds are proportional to the processing power injected in the process. A "smart" miner would be able to target one (**) of his block, lower the cost by 10% which does not compromise the pattern (the cost remains very real).</p>
<p><em>(**) Some super-heavy mining pool, like deepbit, could push the leverage further; but having a single mining operator representing more than 1/2 of the total hashing power of Bitcoin is&nbsp;a big problem for Bitcoin anyway; so I am assuming here that no operator has more than a fraction of the total computing power available</em>.</p>
<p>Then, the <em>1 month old</em>&nbsp;restriction is just there to <strong>increase the odds that the coins do not get lost</strong>. Indeed, since the owner of the targeted addresses do not <em>expect</em>&nbsp;further funds to be pushed on those addresses they may not even monitor them once they have been <em>emptied</em>. Yet, with the 1 month delay, the lucky reward will not stay unnoticed.</p>
<p>Another argument in favor of rewarding the <em>coinbase</em>&nbsp;addresses is that it <strong>increases the incentive on mining efforts</strong>, hence strenghtening Bitcoin as a whole.</p>
<p>Based on the convention established here above, we have now a way to prove that a Bitcoin address did cost at least X BTC to her owner. Yet, we still need a way to be sure that no double-spending attack has already been done.</p>
<p>Here, the intuition is the following: you cannot prevent double-spending with instant payment (aka without block validation), but you can <strong>expose <em>afterward&nbsp;</em>the double-spending attack </strong>which will destroy the trust invested in the <em>provably</em><em>&nbsp;expensive</em> address.</p>
<p>Let Alice be the honest merchant who offer instant Bitcoin payment; let Bob be the bad guy who trying a double-spending attack on Alice.</p>
<p>At the moment of the transaction, Bob gives to Alice the content of the transaction Tx1 that has 1B as input (the address of Bob, proved being expensive) and 1A as output (the address of Alice). Yet, at the very same time, Bob is issuing another transaction Tx2 that empties the address 1B. As a result, after a while, Alice realizes that Tx1 has been rejected.</p>
<p>It's now time for Alice to <em>retaliate</em>&nbsp;by exposing Bob. In order to do that, Alice produces a small dummy transaction to herself where the <strong>transaction Tx1 in recursively embedded as data</strong>&nbsp;though a convention based on <a href="https://en.bitcoin.it/wiki/Script">OP_DROP</a>. (***) Once the transaction Tx1 is exposed, the community of merchants, who like Alice, accept instant transaction withness that 1B cannot be trusted any more because the cumulative effect of the transaction Tx2 going out of 1B and of the <em>exposed</em>&nbsp;transaction Tx1 (which never made its way to the block chain) leads to a negative coin amount on 1B.</p>
<p><em>(***) For the sake of concision I am leaving out the tiny specifics of how exactly should this recursive transaction embedding be implemented. Anyway, based on my understanding of Script, it's perfectly possible to recursively embark a transaction (treated as data) into another transaction.</em></p>
<blockquote>
<p>At this point, we have a system where Bob, the bad guy cannot hurt Alice the merchant (recipient) without getting some retaliation. Yet, what if Alice is a bad merchant and Bob the honest client? Could Alice hurt Bob just for the sake of breaking the community trust into his <em>provably expensive</em>&nbsp;address 1B?</p>
</blockquote>
<p>We need one final touch to the convention <strong>to protect Bob the sender from a false accusation of Alice</strong>. In order to achieve that Bob should make sure each emitted transaction Tx1 from 1B, his <em>provably expensive</em>&nbsp;address, is broadcasted to the network, and not just given to Alice. By doing this, Bob ensures that Tx1 will make its way to the blockchain and prevents Alice to report 1B as dishonest (to be safe Bob is better off putting some transaction fee in Tx1 that guarantees a speedy chain inclusion).</p>
<h3>Implementating the convention</h3>
<p>As far I can tell, the proposal does not involve any breaking change. Ideally, the convention would make its way to the Bitcoin client (or a dedicated fork) to support 3 extra features:</p>
<ul>
<li>Spending BTC to increase the trust level on a particular Bitcoin address.</li>
<li>Performing instant transactions channelled through the "expensive" Bitcoin address.</li>
<li>Reporting the "cost" of the address for the incoming transactions.&nbsp;</li>
</ul>
<p>Then, there is many small details that would need to be polished such as the delay for the community to decide whether trust is lost on an address after being reported. Also, the convention as a whole can also probably be polished further.</p>
<h3>Anonymous payments</h3>
<p>This convention would be one step further is making Bitcoin less anonymous that it is today. Considering the scope of application of instant payments, it does not seem (to me) too much of a problem. If you really want to stay anonymous, then, entering a retail store isn't top notch anyway. Alternatively, for eCommerce, the 1h payment delay is mostly a non-issue (except maybe for pizza delivery).</p>
<h3>In real life</h3>
<p>Instant payments are needed for small purchases: you typically don't need to transfer <strong>both a big amount AND to do it instantly</strong>, it's either or. To accept (or not) whether an instant payment of X BTC made from a proved Y BTC address should go through instantly should be left to the merchant itself.</p>
<p>With a 10 BTC proof, it would reasonable to accept instant payment up to 10 BTC (maybe a bit less assuming a self-serving miner scenario). Coordinating triple-spending (or more)&nbsp;<em>in real life</em>&nbsp;seems complicated (but not impossible) but I seriously doubt people would actually bother for such a complex scheme except to demonstrate its feasability. Indeed, the stakes would be very limited anyway, as anything large would go the usual route of non-instant payments.&nbsp;</p>
<p>Then, looking at <strong>recurring customers payment with the same address</strong> would be also a way to gradually increase the confidence cap (from the merchant viewpoint) for instant payments even without asking the client to increase its proof.</p>
<p>Compared to a rough <strong>2% middleman fee</strong> (based on pricing of BitInstant), I feel that the <em>provably expensive</em> address would be amortized in less than 1 year considering weekly purchase. Not a deal breaker, but still an option probably worth having a look at considering the positive side-effect on the mining side.</p>]]></description><wfw:commentRss>http://vermorel.com/journal/rss-comments-entry-14193327.xml</wfw:commentRss></item><item><title>Lokad.Cloud vs Lokad.CQRS, tiny insights about the future</title><category>.NET</category><category>Cloud Computing</category><category>Lokad</category><category>Lokad</category><category>Lokad.CQRS</category><category>Lokad.Cloud</category><dc:creator>Joannes Vermorel</dc:creator><pubDate>Wed, 23 Nov 2011 22:11:38 +0000</pubDate><link>http://vermorel.com/journal/2011/11/23/lokadcloud-vs-lokadcqrs-tiny-insights-about-the-future.html</link><guid isPermaLink="false">450038:5042156:13845229</guid><description><![CDATA[<p>Among the (small) community interested by the software practices of Lokad to develop entreprise software over Windows Azure, <a href="http://lokad.github.com/lokad-cloud/">Lokad.Cloud</a> vs <a href="http://lokad.github.com/lokad-cqrs/">Lokad.CQRS</a> comes as a&nbsp;<a href="https://groups.google.com/forum/#!topic/lokad/LCLUoM2KY78">recurring question</a>.</p>
<p><em>It's a good question, and to be entirely honest, the case is not 100% solved even at Lokad</em>.&nbsp;</p>
<p>One of the core difficulty to address this question is that Lokad.Cloud and Lokad.CQRS come:</p>
<ul>
<li>from different <strong>backgrounds</strong>:     
<ul>
<li>Lokad.Cloud orginates from the hard-core data analytics back-end.</li>
<li>Lokad.CQRS originates from our<em>&nbsp;</em>behavioral apps.</li>
</ul>
</li>
<li>with different <strong>intents</strong>:     
<ul>
<li>Lokad.Cloud wants to simplify hard-core distributed algorithmics.</li>
<li>Lokad.CQRS wants to provide flexibililty, auditability, extensibility (*).</li>
</ul>
</li>
<li>and different <strong>philosophies</strong>:     
<ul>
<li>Lokad.Cloud is a sticky framework, it defines pretty much how your app is architected.</li>
<li>Lokad.CQRS is more a <a href="http://abdullin.com/journal/2011/6/18/lokadcqrs-framework-that-is-designed-for-farewells.html">NoFramework</a>, precisely designed to minimally impact the app.</li>
</ul>
</li>
</ul>
<p><em>(*) without compromising scalability, however scalability is not the primary purpose.</em></p>
<p>Then, historically, <strong>Lokad.Cloud has been developed first</strong>&nbsp;(which is a mixed blessing), and, as we have been moving forward, we have started to <strong>partition into standalone sub-projects</strong>:</p>
<ul>
<li><a href="https://github.com/Lokad/lokad-cloud-storage">Lokad.Cloud.Storage</a>, the O/C mapper (object to cloud), dedicated to the interactions with the Azure Storage.</li>
<li><a href="https://github.com/Lokad/lokad-cloud-apphost">Lokad.Cloud.AppHost</a>, an AppDomain isolation layer to enable dynamic assembly loading within Azure Worker roles (aka reboot a VM with new assemblies in 5s instead of 5min). (**)</li>
<li><a href="https://github.com/Lokad/lokad-cloud-provisioning">Lokad.Cloud.Provisioning</a>, a toolkit for the Windows Azure Management API.</li>
</ul>
<p><em>(**) Lokad.Cloud does not leverage Lokad.Cloud.AppHost yet, it still relyies on a very similar component (which was developed first, and, as such, is not as properly decoupled than AppHost)</em></p>
<p>Those sub-projects end-up combined into <em>Lokad.Cloud</em>&nbsp;but they can be used independently. Both <em>Lokad.Cloud.AppHost</em> and <em>Lokad.Cloud.Provisioning</em> are <strong>fully compatible with Lokad.CQRS</strong>.</p>
<p>The case of <em>Lokad.Cloud.Storage</em> is a bit more complicated because Lokad.CQRS because <strong>Lokad.CQRS already has its own <a href="https://github.com/Lokad/lokad-cqrs/tree/next/Framework/Lokad.Cqrs.Azure">Azure Storage layer</a></strong>&nbsp;which focuses on CQRS-style storage abstractions. In particular, Lokad.CQRS emphasizes <em>interoperable storage abstractions</em>&nbsp;where the local file storage can be used in place of the cloud storage.</p>
<h3>The Future</h3>
<p>As far I can speak for Lokad.CQRS (see the <a href="http://abdullin.com/cqrs">projet boss</a>), the project will keep evolving focusing on <strong>enterprise software&nbsp;practices</strong>, aka not so much what the framework delivers, but rather how it's intended to structure the app. Then, Lokad.CQRS might be completed by:</p>
<ul>
<li>tools at some point such as a&nbsp;<em>maintenance console</em>.</li>
<li>refined storage abstractions (probably <em>event-centric</em> ones).</li>
</ul>
<p>In constrast, Lokad.Cloud will continue its <strong>partitioning process</strong> to become decoupled and more flexible. In particular,</p>
<ul>
<li>the cloud runtime</li>
<li>the service execution strategy</li>
</ul>
<p>are still very heavily coupled to other concepts within the execution framework, and likely candidates for sub-projects of their own.</p>
<h3>Combining Lokad.Cloud and Lokad.CQRS?</h3>
<p>I would not advise to combine Lokad.Cloud (execution framework) with Lokad.CQRS <em>within the same app</em>. At Lokad, we don't have any project that adopts this pattern, and the resulting architcture seems fuzzy.</p>
<p>However, if we consider the sub-projects of Lokad.Cloud, then the combination&nbsp;<em>Lokad.CQRS + Lokad.Cloud.AppHost + Lokad.Cloud.Provisioning </em>does make a lot of sense.</p>
<p>Then, it's possible to adopt a SOA architecture where some <a href="http://vermorel.com/journal/2011/10/14/oddities-of-machine-learning-software-code.html">heavy-duty functional logic</a>&nbsp;gets isolated, behind an API, into the Lokad.Cloud execution framework, while the bulk of the app adopt CQRS patterns through Lokad.CQRS. This pattern has been adopted to some extent at Lokad.</p>]]></description><wfw:commentRss>http://vermorel.com/journal/rss-comments-entry-13845229.xml</wfw:commentRss></item><item><title>Oddities of machine learning software code</title><category>Algorithms</category><category>Lokad</category><category>Software Engineering</category><category>datamining</category><dc:creator>Joannes Vermorel</dc:creator><pubDate>Fri, 14 Oct 2011 13:24:05 +0000</pubDate><link>http://vermorel.com/journal/2011/10/14/oddities-of-machine-learning-software-code.html</link><guid isPermaLink="false">450038:5042156:13257534</guid><description><![CDATA[<p><span class="full-image-float-left ssNonEditable"><span><img src="http://vermorel.com/storage/melting-wall-clock.jpg?__SQUARESPACE_CACHEVERSION=1318597564826" alt="" /></span></span>Developping machine learning software is <em>special</em>. I did already describe a bit <a href="http://blog.lokad.com/journal/2009/5/9/machine-learning-company-whats-so-special.html">how it feels to be in a machine learning company</a>, but let's be a bit more specific concerning the code itself.</p>
<p>One of most <em>shocking</em>&nbsp;aspect of machine learning code is that it tends to be <strong>full of super-short cryptic 1-letter or 2-letter variable names</strong>. This goes completely against the general naming conventions which emphasis <a href="http://msdn.microsoft.com/en-us/library/ms229045.aspx">readability over brievity</a>. Yet, over the years, I have found that those compact names where best <em>for mathematical / statistical / numerical algorithms</em>.</p>
<p>Indeed,</p>
<ul>
<li>Logic is typically overly <em>intricate</em>, with tons of nested loops and seemingly random stopping conditions. Hence, even if the variables were perfectly readable, the logic would remain show-stopper for any fast-reading attempt.</li>
<li>Variables typically hold <em>intermediate computational results</em>, which cannot be associated with 2 or 3 English words without being extremely ambiguous at best. It's not <em>a = OnButtonClick</em>&nbsp;but rather <em>a = InterpolatedDetrendedDeseasonalizedQuantile90PercentsOfPromotionEffects</em>.</li>
</ul>
<p>As a result, extreme variable name brievity makes the code much more compact which in turns makes it easier to understand the logic. It forces the coder digging into the code to<strong> learn by heart&nbsp;the semantic of the variables</strong>&nbsp;(because names are cryptic), but this effort is only marginal compared to the amount of effort to grasp the logic itself anyway.</p>
<p>Then, <strong><a href="http://en.wikipedia.org/wiki/Magic_number_(programming)">magic numbers</a> are all over the place</strong>, frequently inlined with the rest of the code. Again, for non-machine-learning, magic numbers are a big NO-NO, and a cardinal rule of sane software design consist of <em>clear speration between data and logic</em>. Yet, in statistical algorithms, those seemingly random numerical values are the result of the incremental tuning that is necessary to obtain the desired performance and accuracy.</p>
<ul>
<li>There is no benefit in isolating the <em>magic number</em>, because it is used only once.</li>
<li>The actual numerical value is typically more insightful than the variable name. It helps the developer to get a <em>sense</em>&nbsp;of the behavior of the algorithm.</li>
</ul>
<p>Then, it remains a good practice to add a lot of <strong>inline comments</strong> to justify the purpose of the magic numbers, and how they have been optimized.</p>
<p><strong>If your code is super fast, you're probably getting it wrong</strong>. For most machine learning problems, it's better to try to take advantage of the outrageously large amount of processing power available nowadays to improve results. I am not saying that super fast code is bad in itself, but if your code is super fast, then it means that you've got room to go for more complex methods that would consume more resources in exchange for better, more accurate, results.</p>
<p><a href="http://en.wikipedia.org/wiki/Unit_testing">Unit tests</a> are both very handy to validate small block pure-mathematical operations, and yet,&nbsp;<strong>quasi-useless</strong>&nbsp;for the bulk of the machine learning logic. Indeed, when it comes to statistical accuracy, there is no black &amp; white, but only in <em>shades of gray</em>. As long performance is acceptable, the overall accuracy is the only metric that matter. In particular, it happens, from time to time, that a <em>bug</em>&nbsp;- aka a piece of code that does not reflect the original&nbsp;<em>intent</em>&nbsp;of the developer - turns out to behave well over the data. On average, <em>bugs</em> tend to degrade accuracy, but sometime, it just stumbles upon an interesting (and counter-intuitive) behavior.</p>
<p>Finally, <strong>Object-Oriented Programming is still around, but seldom used</strong>:&nbsp;<a href="http://www.nicollet.net/2011/10/functional-programming/">Functional Programming</a> is king. This pattern reflects the fact that the machine learning problem itself, either <a href="http://en.wikipedia.org/wiki/Classifier_(mathematics)">classification</a> or <a href="http://en.wikipedia.org/wiki/Regression_analysis">regression</a> is nothing but trying to build a big complex function to tackle real-world data.</p>]]></description><wfw:commentRss>http://vermorel.com/journal/rss-comments-entry-13257534.xml</wfw:commentRss></item></channel></rss>