Monday, February 02, 2009

Experiments with Azure - changing the structure of live data

Before I start......

Sorry if this blog post is a bit too brief and too technical. I could write about this topic all day and still not get it covered - so I've gone brief and techie. You will probably need at least a little experience in Azure (or GAE or EC2) to understand this...

On with the post....

The problems...

I wanted to update my http://www.stacka.com and http://www.clouddotnet.com Azure apps - especially adding recent comments and recent ratings to the front page.

And so the problems started.... it was very revealing about my data structure....

Firstly, I had organised my comments and my ratings (partition key and row key) so that they were easily accessible in time order from an individual "stacka" (or site in the couddotnet case):
- I had the PartitionKey as StackId
- I had the RowKey as a reverse time index (a bit like Steve Marx's blog examples - see http://blog.smarx.com)

Problem 1: Now because there is no "order by" allowed in the Linq for Azure table storage - so the order returned is the partition key, row key order... this presented a challenge in how to get all my data back in the right order

Secondly, I had not stored simple data like Stack title (site title for clouddotnet) in the rating and comment entries

Problem 2: Because there is no "join" allowed in the Linq for Azure table storage (and nor is Contains allowed) then listing comments/ratings alongside their stack/site names was going to be really slow.

Thirdly, I wanted to present a random set of Stacks (or sites) - rather than just the latest

Problem 3: How do you pick a random set when there's no order by, no count, etc available?


To solve these problems....

I had to go back to my data schema - which in Azure, of course, is just the public properties of my data classes.

For problem 1:
  • I've bodged it....
  • While my data size is so small, I'm actually pulling back all the comments and ratings into app memory and sorting them there. To help prevent some slowness I'm using the HttpContext.Current.Cache cacheing with a one minute absolute time expiration)  
  • What I need to do in the longer term is either to change the PartitionKey and RowKey so I can get results returned by time while still being able to search quickly by stack - I think this will be simple enough to do - but I think it will require a change in table name.
  • (An alternative in the long term would be to add another table to act as an index - but I think in this case that is not needed)

For problem 2:
  • I had to add on the public property StackTitle to the CommentRow class - and similarly for the RatingRow class.
  • I then had to write some code to update the existing data to include these StackTitle properties - I did this in a simple Windows Form app that I ran on my local PC - it used the same class libraries as the real ASP.Net app - and upgraded the data live without any users noticing - very easy.
  • Then I changed the ASP.Net code to use the new structures and deployed the new code :)
For problem 3:
  • I've bodged it....
  • While my data size is so small, I'm actually pulling back all the stacks and selecting a random set there
  • What I need to do in the longer term is to create separate lists of random items - I think I would do this in a worker role - maybe creating a new random list once every minute - and storing up to 10 random list in the azure table storage at any one time? This would be simple enough to do - and not too bad on processing or redundancy.
Some side notes:
  • Azure storage is a bit of a mindf*&k - you have to stop thinking relational - you definitely have to stop thinking normalised.
    • Do not read this and think - "this is horribly inefficient"!
    • Azure storage (like Amazon's SimpleDB and like Google's BigTable) is very very efficient
    • It's just that you have to keep thinking about efficiency in terms of speed of retrieving data for the user - not in terms of number of bytes stored or in terms of duplicated data.
    • It's a classic MIPs versus Memory trade off - and in the world of the cloud, then MIPs and Memory are both cheap, but when it comes to responding to the user then you cut MIPs at the expense of Memory.
  • Changing the schema was really simple
  • Working out what is actually now stored in the data store is a bit tricky - at one point I added some rows to my live data store with an extra column... those cells will live on for all time now - but I can't work out right now how to actually work out which rows have those cells.
  • The use of a desktop tool to upgrade the data was beautifully simple (and it used https - so it was secure)
  • While the bodges might:
    • They're a good programming approach - as long as I know they work and as long as I know that they can be changed, then it's actually a solution that is scalable in development effort terms?
    • They're a p1ss poor programming approach... there is definitely an argument that it's better to do the fix right first time...
And the conclusion:

You can see the results on:
Well, I think it was worth it anyway :)

No comments:

Post a Comment