Sunday, March 29, 2009

the challenge of the scale

After the dotcom bubble broke, we gradually get a new one-web 2.0. However, this time it is more fun. From a brief history of web 2.0, we can see the birth of google marked the infancy of this new age of Internet. The most notable feature in this age is collective wisdom. Well, you may say long tail, large scale collaboration, and whatever. The point is the users are the leading role of the stage. So what's the implication of this trend on technology? Users means page view and site traffic. And what's more than that is the scale of traffic and data. How could we deal with this thing? The question is the same, the answer varies from one company to another. highscalability.com made a great contribution for the community to learn from each other.

Since google published some papers on its secret weapons, many companies have disclosed their technology architectures and shared their experience in a variety of talks. I just make a simple classification about these architectures:

1. cloud computing
features: homegrown solutions from scratch for large scale data processing, distributed,tolerant and high available file system; distributed schemaless database/document store; computing grid/distributed job scheduler
example: google, amazon
technology: GFS, Bigtable, MapReduce, Chubby, Dynamo, EC2, S3, SimpleDB

2. LAMP
features: customized LAMP, some homegrown solutions, some clones of class 1
example: yahoo, livejournal, youtube, flickr, facebook
technology: linux, LVS, Apache, Mysql, PHP, Squid, memcached, MogileFS, Perlbal, DJabberd, The Schwartz, Spread, Hadoop, HBase, ZooKeeper, Hypertable,

3. JAVA EE
features: classic N-tier architecture,2PC transaction, application server clustering, db replication, caching/in memory data grid
example: Ebay(Note:maybe ebay is not a good example of this class because ebay don't use 2pc transaction), many banks and security companies
technology: jsp, web frameworks, jee application server, messaging middleware, commercial relational db

4. MS suite
features: N-tier architecture, partition, caching
example: MySpace
technology: Asp.NET, sql server, windows server

It is clear the first two classes of architecture draw much attention these days, partly because open source software has got accelerating adoptions. On the contrary, commercial solutions are more likely to be adopted by those tycoons who can just throw money on everything. Each class of architecture may solve the scaling problem in one way or another. But it is hard to estimate how cost effective each class of architecture might be. On one hand, homegrown solutions may solve the problem more effectively and provide more flexibility, but maybe need more efforts to build. On the other hand, commercial solution may also solve the problem with equivalent efficiency, but must need more money. The key is the architecture must be extensible for new functional requirements and scalable for increasing user traffic.

2 comments:

Unknown said...

There are lots of information about hadoop have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get to the next level in big data. Thanks for sharing this.

Hadoop training velachery
Hadoop training in velachery
Big data training in velachery

Unknown said...

I was just wondering how I missed this article so far, this is a great piece of content I have ever seen in the entire Internet. Thanks for sharing this worth able information in here and do keep blogging like this.

Hadoop Training Chennai | Best hadoop training institute in chennai | Hadoop training institutes in chennai