Author

Portrait of Joannes Vermorel

I am Joannes Vermorel, founder at Lokad. I am also an engineer from the Corps des Mines who initially graduated from the ENS.

I have been passionate about computer science, software matters and data mining for almost two decades,

Meta

Entries in patterns (7)

Friday
Sep182009

Azure Management API concerns

Disclaimer: this post is based on my (limited) understanding of the Azure Management API, I did start reading the docs only a few hours ago.

Microsoft has just released the first preview of their Management API for Windows Azure.

As far I understand the content of the newly released API (check the MSDN reference), this API just let you automates what was done manually through the Windows Azure Console so far.

At this point, I have two concerns:

  1. No way to adjust your instance count for a given role.

  2. Auto-management (*) involves loads of quirks.

(*) Auto-Management: the ability for a cloud app to scale itself up and down depending on the workload.

I am not really satisfied by this Management API as it does not seem to address basic requirements to easily scale up or down my (future) cloud app.

Being able to deploy a new azure package programmatically is nice, but we were already doing that in Lokad.Cloud. Thanks to the AppDomain restart trick, I suspect we will keep deploying that way, as the deployment through Lokad.Cloud is likely to be still 100x faster.

That being said, the Management API is powerful, but it does not seem to address auto-management, at least not in a simple fashion.

The single feature I was looking forward was being able to adjust the number of instances on-demand through a very very simple API that would have let me do three things:

  1. Create new instance for the current role.

  2. Shut down current instance.

  3. Get the status of instances attached to the current role.

That's it!

Notice that I am not asking here to deploy a new package, or to change the production/staging status. I just need to be able tweak the instance count.

In particular, I would expect a Non-SSL REST API to do those limited operations, much like the other REST API available for the cloud storage.

Indeed, security concerns related to the instance count management are nearly identical to the ones related to the cloud storage. Well, not really, as in practice securing your storage is way much more sensitive.

Monday
Sep142009

Thinking the Table Storage of Windows Azure

Disclaimer: I am not exactly a Table Storage expert. In this post, I am just trying to sort out my own thoughts about this service offered with Windows Azure. Check my follow-up post.

Soon after the release announcement of the release of our new O/C mapper (object to cloud) named Lokad.Cloud, folks on the Azure Forums raised the question of the Table Storage.

Although it might be surprising, Lokad.Cloud does not provide - yet - any support for Table Storage.

At this point, I feel very uncertain about Table Storage, not in the sense that I do not trust Microsoft to end-up with finely tuned product, but rather at the patterns and practices level.

Basically, the Table Storage is an entity storage that features three special system properties:

  • PartitionKey: a grouping criterion - data having the same PartitionKey being kept close.

  • RowKey: the unique identifier for the entity.

  • Timestamp: the equivalent of Blob Storage ETag.

So far, I got the feeling that many developers feel attracted toward the Table Storage for the wrong reasons. In particular, Table Storage is not a substitute of your old plain SQL tables:

  • No support for transactions.

  • No support for keys (let alone foreign keys).

  • No possible refactoring (properties are frozen at setup).

If you are looking for those features, you're most likely betting on the wrong horse. You should be considering SQL Azure instead.

Then, some might argue that SQL Azure won't scale above 10GB (at least considering the current pricing plans offered by Microsoft). Well, the trick is Table Storage won't scale either, at least not unless you're not very cautious with your queries.

AFAIK, the only indexed column of the Table Storage is the RowKey. Thus, any filtering criterion based on custom entity properties is likely to get abyssal performance as soon your Table Storage get large.

Well, sort of, the most probable scenario is like to to be worse as your queries are just going to timeout after exceeding 60s.

Again, my goal here is not to bash the Table Storage, but it must be understood that the Table Storage is clearly not a magically scalable equivalent of the plain old SQL tables.

Back to Lokad.Cloud, we did not consider adding Table Storage because we did not feel the need either although our forecasting back-end is probably very high in the currently complexity spectrum of the cloud apps.

Indeed, the Blob Storage is surprisingly powerful with very predicable performance too:

  • Storing complex objects is a non-issue with a serializer at hand.

  • A blob name prefix is a very efficient substitute to the PartitionKey.

Basically, it seems to me that any Table Storage operation can be executed with the same performance with the Blob Storage for now. Later on, when the Table Storage will start supporting secondary indexes, this situation is likely to evolve, but meantime I still cannot think a single situation that would definitively support Table Storage over Blob Storage.

Monday
Sep142009

O/C mapper - object to cloud

When we started to port our forecasting technology toward the cloud, we decided to create a new open source project called Lokad.Cloud that would isolate all the pieces of our cloud infrastructure that weren't specific of Lokad.

The project has been initially subtitled Lokad.Cloud - .NET execution framework for Windows Azure, as the primary goal of this project was to provide some cloud equivalent of the plain old Windows Services. We did quickly end-up with QueueServices which happens to be quite handy to design horizontally scalable apps.

But more recently, the project has taken a new orientation, becoming more and more an O/C mapper (object to cloud) inspired by the terminology used by O/R mappers. When it comes to horizontal scaling, a key idea is that data and data processing cannot be considered in isolation anymore.

With classic client-server apps, persistence logic is not supposed to invade your core business logic. Yet, when your business logic happens to become so intensive that it must be distributed, you end-up in a very cloudy situation where data and data processing becomes closely coupled in order to achieve horizontal scalability.

That, being said, close coupling between data and data processing isn't doomed to be an ugly mess. We have found that obsessively object-oriented patterns applied to Blob Storage can made the code both elegant and readable.

Lokad.Cloud is entering its beta stage with the release of the 0.2.x series, check it out.

Monday
Nov192007

Delete-proof data paging

In order to retrieve a large amount of data from a SQL table, you need to resort to a data paging scheme. Conceptually, a typical paged SQL query looks like (the syntax is approximate and vaguely inspired from MS SQL Server 2005)

SELECT Foo.Bar
FROM Foo
WHERE RowNumber() BETWEEN @Index AND @Index + @PageSize
ORDER BY Foo.Id;

The queries are made iteratively until no rows get returned any more. Yet, this approach fails both if rows are added or deleted in the table Foo during the iteration.

If rows are inserted, then the RowNumbers() will be impacted. Some rows will see their number to be incremented.

  • The newly inserted rows maybe missed.

  • Certain rows are going to be retrieved twice.

In the overall, the situation isn't bad, because, after all, all rows that were present when the retrieval iteration started will be retrieved.

In the other hand, if rows are deleted, then some rows will get their RowNumbers() decremented. As a consequence, certain rows will never get retrieved. This issue is quite troublesome, because you would not expect deleted rows to prevent the retrieval of other (valid) rows.

One workaround is to this situation would be to add some IsDeleted flag to the table (instead of actually deleting rows, flags would be changed from false to true); and to purge the database only once a while. Yet, this solution has many drawbacks: table schema must be changed and all existing queries that target this table must add the extra condition IsDeleted = false.

A more subtle approach to the problem consists in changing the definition of the iterating index. Instead of using an index that correspond to a generate RowNumbers(), let directly use @IndexId, the greatest identifier retrieved so far. Thus, the paged query becomes

SELECT TOP @PageSize
FROM Foo
WHERE Foo.Id > @IndexId
ORDER BY Foo.Id

With this new approach, we are now certain that no rows will be skipped during the retrieval process. Plus, we haven't made any change to the table schema.

Tuesday
Jun122007

Few traps for former C++ developers coding in C#

I have been proofreading a couple of time C# source code written by former C++ developers. Also C# and C++ have a lot in common at the syntax level, the conceptual framework underlying the two languages is really different. This post is a negligible attempt to point out the most common issues.

C# is garbage collected. You should not need any destructor. If you are using destructors, then it's probably wrong 99% of the time. If you have doubts, ignore destructors, it's the way to go with garbage collected languages.

Pre-processor statements are deprecated. Although C# does have statements like #if #else #endif, the pre-processor statements are de-facto a deprecated construct in C#. Like destructors, they are some very specific cases where those statements can be used. Yet, 99% of the time, using those statements would just be a design mistake.

C# comes with native guidelines. Microsoft has published very extensive guidelines about variable, method, class naming. Every single C# line you write is expected to be compliant with those guidelines.