Author

Portrait of Joannes Vermorel

I am Joannes Vermorel, founder at Lokad. I am also an engineer from the Corps des Mines who initially graduated from the ENS.

I have been passionate about computer science, software matters and data mining for almost two decades. (RSS - ATOM)

Meta
Tags

Entries in algorithm (6)

Saturday
Feb272010

MapReduce as burstable low-cost CPU

About two months ago, when Mike Wickstrand setup a UserVoice instance for Windows Azure, I immediately posted my own suggestion concerning MapReduce. MapReduce is a distributed computing concept initially published by Google late 2004.

Against all odds, my suggestion, driven by the needs of Lokad, made it into the Top 10 most requested features for Windows Azure (well, 9th rank and about 20x times less voted than the No1 request for scaled down hosting).

Lately, I had the opportunity to discuss more with folks at Microsoft gathering market feedback on this item. In software business, there is frequent tendency for users to ask for features they don't want in the end. The difficulty being that proposed features may or may not correctly address initial problems.

Preparing the interview, I realized that, to some extend, I had fallen for the same trap when asking for MapReduce. Actually, we have already reimplemented our own MapReduce equivalent, which is not that hard thanks to the Queue Storage.

I care very little about framework specifics, may it be MapReduce, Hadoop, DryadLinq or something not-invented-yet. Lokad has no cloud legacy calling for a specific implementation.

What I do care about is much simpler. In order to deliver truckloads of forecasts, Lokad needs :

  1. large scale CPU
  2. burstable CPU
  3. low cost CPU

Windows Azure is already doing a great job addressing Point 1. Thanks to the massive Microsoft investments on Azure datacenters, thousands of VMs can already be instantiated if needed.

When asking for MapReduce, I was instead expressing my concern for Point 2 and Point 3. Indeed,

  • Amazon MapReduce offers 5x cheaper CPU compared to classical VM-based CPU.
  • VM-based CPU is not very burstable: it takes minutes to spawn a new VM, not seconds.

Then, low-cost CPU is somehow conflicting with burstable CPU, as illustrated by the Reserved Instances pricing of Amazon.

As far low-level cloud computing components are concerned, lowering costs usually mean giving up on expressiveness as a resulting trade-off:

  • Relational DB at $10/GB too expensive? Go for NoSQL storage at $0.1/GB, much cheaper, but much weaker as far querying capabilities are concerned.
  • Guaranteed VMs too expensive? Go for Spot VMs, price is lower on average but you've have no more certainties about either the price or the availability of VMs.
  • Latency of cloud storage too high? Go for CDN, latency is much better for reads, yet, much worse for writes.

Seeking large scale burstable CPU, here are the list of items that we would be very willing to surrender in order to lower the CPU pricing:

  • No need for local storage. VM comes with 250GB hard-drive, which we typically don't need.
  • No need for 2GB of memory. Obviously, we still need a bit of memory but 512MB would be fine.
  • No need for any level of access to the OS
  • Runtime could be made .NET only, and restricted to safe IL (which would facilitate code sandboxing).
  • No need for generic network I/O. Contrained accesses to specific Tables / Queues / Containers would be fine. This would facilitate colocation of storage and CPU.
  • No need for geolocalized resources. Cloud can push wherever CPU is available. Yet, we would expect no to be charged from bandwidth that happens between cloud data centers (if the transfer is caused by offsite computations).
  • No need for fixed pricing. Prioritization of requests based on a variable pricing would be fine (considering that the CPU price could be lowered in average).

Obviously, options are plenty to drag the price down in exchange of a more constrained framework. Since Azure has the unique opportunity to deliver some very .NET oriented features, I am especially interested by approaches that would leverage sandboxed code executions - giving up entirely on the OS itself to purely focus on the .NET Runtime.

I am very eager to see how Microsoft will be moving forward on this request. Stay tuned.

Tuesday
May192009

FIPFO - First In Probably First Out

The FIFO (First In First Out) is a very well known concept in computer science. In one of my previous post, I used the word FIPFO to refer to First In Probably First Out to refer to the cloud equivalent of the FIFO.

Indeed, the basic idea behind that term is that you can't scale much pure FIFOs due to synchronization constraints. Yet, if you just loosen a little bit the semantic, that is to say, FIPFO, then you have an infinitely scalable data structure.

Considering its simplicity and usefulness, I believe that FIFPO will be ubiquitous in future cloud computing applications.

Matthieu, a colleague at Lokad, was asking me if FIPFO was a well-known concept or yet another wacky term that I had just made-up on my blog. Well, sorry, I can't really remember. FIPFO seems just to be a very appropriate way to describe data structure such as the Windows Azure Queue, but I am not sure if I read about it elsewhere now.

According to Google, there is just a single other result at the time for FIPFO by another person who came up with this term a few days before my initial post while describing the Drupal API. Yet, I had never read the Drupal forums before, so I guess we did separately end up with the same idea and terminology.

Friday
May152009

Machine learning company, what’s so special?

Developing a machine learning startup is a very particular venture. Check out: Machine learning company, what’s so special? (based on my experience at Lokad)

Monday
Mar232009

High-perf SelectInParallel in 120 lines of C#

A few months ago at Lokad, we started working on 8-core machines. Multi-core machines need adequate algorithmic design to leverage their processing power; and such a design can be more or less complicated depending of the algorithm that you are trying to parallelize.

In our case, there were many situations where the parallelization was quite straightforward: large loops, all iterations being independents. At that time, PLinq, the parallelization library from Microsoft wasn't still available as a final product (it will be shipped with Visual Studio 2010). Thus, since we were quite in a hurry, we decided to code our own SelectInParallel method (code being provided below). Basically, it's just Select but with a parallel execution for each item being selected.

Although, being surprisingly simple, we found out that, at least for Lokad, SelectInParallel alone was fitting virtually 99% of our multi-core parallelization needs.

Yet, when we did start to try to speed-up algorithms with our first SelectInParallel implementation, we did end-up stuck with poor speed-up ratio at 3x or even 2x where I was expecting near 8x speed-up.

At first, I thought it was an illustration of the Amdahl's law. But a more detailed performance investigation did show I was just plain wrong. The harsh reality was: threads, when not (very) carefully managed, involve a (very) significant overhead.

Our last SelectInParallel implementation is now 120 lines long with a quasi-negligible overhead, i.e. bringing a near linear speed-up with the number of CPU cores on your machine. Yet, this performance wasn't easy to achieve. Let's review two key aspects of the implementation.

Keep your main thread working: In the first implementation, we did follow the naive pattern: start N-threads (N being the number of CPUs), wait for them to finish, collect the results and proceed. Bad idea, if the amount of work happens to be small, then, simply waiting for your threads to start is going to be a performance killer. Instead, you should start N-1 threads, and get your calling thread working right away.

Avoid synchronization altogether: At first, we were using a Producer - Consumer threading pattern. Bad idea again: it produces a lot of locking contention, the work queue becoming the main bottleneck of the process. Instead, an arithmetic trick can be used to let the workers tackle disjoint workset right from the beginning and without any synchronization.

So far, we have been quite satisfied by our 120-lines ersatz to PLinq. Hope this piece of code can help a few other people to get the most of their many-core machines. If you have ideas to improve further the performance of this SelectInParallel implementation, just let me know.

using System;
using System.Threading;

namespace Lokad.Threading
{
///<summary>
/// Quick alternative to PLinq.
///</summary>
public static class ParallelExtensions
{
static int _threadCount = Environment.ProcessorCount;

/// <summary>Get or sets the number of threads to be used in
/// the parallel extensions. </summary>
public static int ThreadCount
{
get { return _threadCount; }
set
{
_threadCount = value;
}
}

/// <summary>Fast parallelization of a function over an array.</summary>
/// <param name="input">Input array to processed in parallel.</param>
/// <param name="func">The action to perform (parameters and all the members should be immutable!!!).</param>
/// <remarks>Threads are recycled. Synchronization overhead is minimal.</remarks>
public static TResult[] SelectInParallel<TItem, TResult>(this TItem[] input, Func<TItem,TResult> func)
{
var results = new TResult[input.Length];

if (_threadCount == 1 || input.Length == 1)
{
for(int i = 0; i < input.Length; i++)
{
results[i] = func(input[i]);
}

return results;
}

// perf: no more thread than items in collection
int threadCount = Math.Min(_threadCount, input.Length);

// perf: start by syncless process, then finish with light index-based sync
// to adjust varying execution time of the various threads.
int threshold = Math.Max(0, input.Length - (int) Math.Sqrt(input.Length) - 2*threadCount);
int workingIndex = threshold - 1;

var sync = new object();

Exception exception = null;

int completedCount = 0;
WaitCallback worker = index =>
{
try
{
// no need for lock - disjoint processing
for(var i = (int) index; i < threshold; i += threadCount)
{
results[i] = func(input[i]);
}

// joint processing
int j;
while((j = Interlocked.Increment(ref workingIndex)) < input.Length)
{
results[j] = func(input[j]);
}

var r = Interlocked.Increment(ref completedCount);

// perf: only the terminating thread actually acquires a lock.
if (r == threadCount && (int)index != 0)
{
lock (sync) Monitor.Pulse(sync);
}
}
catch (Exception ex)
{
exception = ex;
lock (sync) Monitor.Pulse(sync);
}
};

for (int i = 1; i < threadCount; i++)
{
ThreadPool.QueueUserWorkItem(worker, i);
}
worker((object) 0); // perf: recycle current thread

// waiting until completion or failure
while(completedCount < threadCount && exception == null)
{
// CAUTION: limit on wait time is needed because if threads
// have terminated
// - AFTER the test of the 'while' loop, and
// - BEFORE the inner 'lock'
// then, there is no one left to call for 'Pulse'.
lock (sync) Monitor.Wait(sync, 10.Milliseconds());
}

if(exception != null)
{
throw exception;
}

return results;
}
}
}
Monday
Nov192007

Delete-proof data paging

In order to retrieve a large amount of data from a SQL table, you need to resort to a data paging scheme. Conceptually, a typical paged SQL query looks like (the syntax is approximate and vaguely inspired from MS SQL Server 2005)

SELECT Foo.Bar
FROM Foo
WHERE RowNumber() BETWEEN @Index AND @Index + @PageSize
ORDER BY Foo.Id;

The queries are made iteratively until no rows get returned any more. Yet, this approach fails both if rows are added or deleted in the table Foo during the iteration.

If rows are inserted, then the RowNumbers() will be impacted. Some rows will see their number to be incremented.

  • The newly inserted rows maybe missed.

  • Certain rows are going to be retrieved twice.

In the overall, the situation isn't bad, because, after all, all rows that were present when the retrieval iteration started will be retrieved.

In the other hand, if rows are deleted, then some rows will get their RowNumbers() decremented. As a consequence, certain rows will never get retrieved. This issue is quite troublesome, because you would not expect deleted rows to prevent the retrieval of other (valid) rows.

One workaround is to this situation would be to add some IsDeleted flag to the table (instead of actually deleting rows, flags would be changed from false to true); and to purge the database only once a while. Yet, this solution has many drawbacks: table schema must be changed and all existing queries that target this table must add the extra condition IsDeleted = false.

A more subtle approach to the problem consists in changing the definition of the iterating index. Instead of using an index that correspond to a generate RowNumbers(), let directly use @IndexId, the greatest identifier retrieved so far. Thus, the paged query becomes

SELECT TOP @PageSize
FROM Foo
WHERE Foo.Id > @IndexId
ORDER BY Foo.Id

With this new approach, we are now certain that no rows will be skipped during the retrieval process. Plus, we haven't made any change to the table schema.