Monday, March 30, 2009

the lessons of scaling

In the previous post, I make a simple classification on current website architectures according to their technology features. So what we can learn from these architectures? Before any elaboration I will point out most of principles are proposed for scaling-out rather than scaling-up, because scaling-up just don't do well. In contrast, commodity PC cluster is entering the mainstream. Well, Here is my notes.

1. partition

It seems that partition is a quite familiar concept for DBA rather than developers. I mean, almost any fancy database support table partitioning fully or partially. That means big volume table can be sliced into multiple small and manageable tables. What's more than that is sharding. Or we can just call it cross database partitioned table or horizontal partition. Sharding is almost the standard approach adopted by most of today's web2.0 companies. But note that sharding is not out-of-box components provided by database vendors. Instead, anyone who want to use it may end up with building a sharding solution specific to the use case. Although there are some partial solutions like Hibernate shard and Mysql Proxy out there, in most cases you have to customize or treak for your needs.

The essence of partition is divide and conquer. However, this principle also applies very well in the whole software stack. In application layer, we can partition a monolithic system by function into independent system unit or just design such architecture from the start in mind. In SOA terms, well defined self-contained services. Because each unit has its own constraints and characteristics, so each unit can be independently developed and optimized for performance and scalability. For example, In a classic online shop, there may be signing service, item view service, item search service, ordering service, and so on. Each of them has different user experience tolerance characteristic and IO characteristic. We can use different strategy to implement these services for different functional and nonfunctional needs. But one important assumption of function partition is the system is stateless. In other word, service is self-contained. So we avoid server side http session or stateful session bean into a minimum.

2. caching

Caching is a well-known hammer to crack performance problem. Also caching can be applied in many use cases. Database do have its own query result cache. If your server has abundant memory, allocate more of them for database query result cache can make significant impact in response time. Aside from database, developers are more familiar with caching in application layer. There are many caching solutions
for popular web programming languages. The most notable one is memcached which is almost the standard configuration in most web2.0 companies.

3. avoid distributed transaction

This one is also the most provoking and controversial one. However, it has been regarded as an important principle in Ebay's scalable architecture. But this is not a new idea at all. The basic logic behind this principle was proposed as CAP theorem in the earlier 90s by Eric Brewer. The theorem is also called Brewer conjecture.
However, CAP theorem was proved thereafter. So we can make architectural decisions based on this assumption. For most of today's web services, availability and partition are fixed factors. So we have to sacrifice consistency for availability. That usually means ACID properties provided by relational database will not be available anymore. Instead, we end up with a different architecture: BASE. However, we do need eventual consistency in some cases. So we have to introduce other mechanism for compensation and correction to reach eventual consistency. Concrete implementing strategy must be tricky.

4. asynchronous processing

Another proven approach of scaling is identify time-insensitive processes and do it asynchronously. The point is decouple one process from the others and thus each process can be simple and easy to scale and most importantly, without blocking other processes. Sometimes asynchronous principle is called event driven model. That's true that asynchronous processing always involves some kind of notification about the result of processing. Messaging middleware have been widely used for this purpose. Order processing, billing, BI are all in this spectrum. Queue mechanism has been touted as the weapon for (XTP)Extreme Transaction Processing. When high volume load is queued for later processing, the system scales. But behold, the pressure is actually transferred from application to messaging middleware. So if the messaging middleware itself can't scale well, there might be a nightmare.

In addition, some OS and programming language provide asynchronous abstraction. In OS level, Asynchronous IO has been developed in windows(CIOP) and linux(AIO) for better scalability. Some language libraries also provide such abstraction. Java has concurrency package that provide future concept. Boost library has an AIO module for network programming which implemented Proactor pattern. And event based IO in linux like epoll has prevailed with the large adoption of web server like lighttpd. However, in the context of network programming, threading or events is a question. There are some provoking discussions worth a read.Why events bad and Why threads bad. Don't be confused by the title. Read in context.

5. failure oriented

The 8 fallacies of distributed computing are well-known in the distributed system field for years. Actually all of them are false assumptions we are likely to make when designing distributed system. Although some of them seems naive for today's architects, most of them still apply. In other words, when system gets distributed, things get ugly. We have to be prepared for dealing with such ugly things. A good start is list all components of the system, assume each of them is likely to fail in some cases, and figure out how to deal with it in such cases. The more worse the cases you think of and prepare for, the more stable the system would be. There are many researches and practices on the issue. Hardware redundancy, software instance replication for availability, automatic failover, backup and revovery, data replication for error tolerance and so on. In programming language, Erlang really did a good job in this area. Erlang was designed with "failure is everywhere" in mind. Each process can have a monitor process to watch if it is healthy and keep it up in case of failure. However, there is a "who monitor the monitor" problem. That's why we can only build system with several 9.

6. virtulization

Virtual machine is well understood for testing purpose. It is very easy to install several virtual machine images in one psychical machine and test programs for different OS environment. But today virtulization has been leveraged for building large scale cloud computing infrastructure, because it can provide a better abstraction of computing power. User can instantly use it without worrying about networking, power or disk failure. That's the essence of utility computing either. Enterprise can get better utilization rate of its computing resources. For now cloud computing is provided by IT giants like Amazon. But it would be promising to see how cloud infrastructure fits into the enterprise scope.

Note: I will update this post when new idea comes up.

reference:

1. http://highscalability.com
2. http://queue.acm.org/detail.cfm?id=1394128
3. http://queue.acm.org/detail.cfm?id=1466448
4. http://www.infoq.com/articles/ebay-scalability-best-practices
5. http://www.manageability.org/blog/stuff/about-ebays-architecture
6. http://www.manageability.org/blog/stuff/cache-tier-architecture
7. http://www.ccs.neu.edu/groups/IEEE/ind-acad/brewer/sld009.htm
8. www.atomikos.com/downloads/articles/TransactionsForXTP-WhitePaper.pdf

Sunday, March 29, 2009

the challenge of the scale

After the dotcom bubble broke, we gradually get a new one-web 2.0. However, this time it is more fun. From a brief history of web 2.0, we can see the birth of google marked the infancy of this new age of Internet. The most notable feature in this age is collective wisdom. Well, you may say long tail, large scale collaboration, and whatever. The point is the users are the leading role of the stage. So what's the implication of this trend on technology? Users means page view and site traffic. And what's more than that is the scale of traffic and data. How could we deal with this thing? The question is the same, the answer varies from one company to another. highscalability.com made a great contribution for the community to learn from each other.

Since google published some papers on its secret weapons, many companies have disclosed their technology architectures and shared their experience in a variety of talks. I just make a simple classification about these architectures:

1. cloud computing
features: homegrown solutions from scratch for large scale data processing, distributed,tolerant and high available file system; distributed schemaless database/document store; computing grid/distributed job scheduler
example: google, amazon
technology: GFS, Bigtable, MapReduce, Chubby, Dynamo, EC2, S3, SimpleDB

2. LAMP
features: customized LAMP, some homegrown solutions, some clones of class 1
example: yahoo, livejournal, youtube, flickr, facebook
technology: linux, LVS, Apache, Mysql, PHP, Squid, memcached, MogileFS, Perlbal, DJabberd, The Schwartz, Spread, Hadoop, HBase, ZooKeeper, Hypertable,

3. JAVA EE
features: classic N-tier architecture,2PC transaction, application server clustering, db replication, caching/in memory data grid
example: Ebay(Note:maybe ebay is not a good example of this class because ebay don't use 2pc transaction), many banks and security companies
technology: jsp, web frameworks, jee application server, messaging middleware, commercial relational db

4. MS suite
features: N-tier architecture, partition, caching
example: MySpace
technology: Asp.NET, sql server, windows server

It is clear the first two classes of architecture draw much attention these days, partly because open source software has got accelerating adoptions. On the contrary, commercial solutions are more likely to be adopted by those tycoons who can just throw money on everything. Each class of architecture may solve the scaling problem in one way or another. But it is hard to estimate how cost effective each class of architecture might be. On one hand, homegrown solutions may solve the problem more effectively and provide more flexibility, but maybe need more efforts to build. On the other hand, commercial solution may also solve the problem with equivalent efficiency, but must need more money. The key is the architecture must be extensible for new functional requirements and scalable for increasing user traffic.