Wednesday, April 22, 2009

operation dimension of system architecture

In terms of software architecture, there are usually various stakeholders involved in a specific system architecture. Each of them might has different architectural requirements. Product department often submits functional requirements. Operation department often submits system management or monitoring requirements. Accounting department may submits billing requirements. And in some cases the system has its own inherent non-functional requirements such as performance, availability and other SLA guarantees. In one word, a system architecture always involves quite a lot dimensions. We have to think about all of them so as to get a full picture of the system. However, developers are usually myopic so that they rarely think about other dimensions. After all, when system rolls out, developers have to work closely with operation people to get feedbacks about production system. If developers don't get well prepared, they may end up with getting nothing. Even worse, they will get entangled into operation aspect. Here are some points developers could consider in advance and prepare for.

The first question is how to get production system status?

The common approach is log extensively in the system itself and send notification email when things get abnormal. Simple! But it don't work when the system is down. And another disadvantages is that application level logging only cares about the system itself. How about machine poweroff or disk failure or network outrage?

So we should have an independent and full functional health management system. Usually this system is maintained by operation department. Then there is a gap, social and technical. The social one is that the two department have to cooperate to make system work. The technical one is about how to make existing health management system be aware of the new system. It depends on both sides. The health management system should be extensible so that it can adapt to any kind of new system. Luckily some full functional monitoring systems qualify. And the new system itself should provide health checking interface that would be called by health management system. So far so good. When system goes wrong, the health management system will get notification in the first place. If they can deal with it, developers can sleep well. Otherwise, developers will get busy.

Another important point is that trust should be built between operation department and development department. Developers should add a lens which can view the dimension of operation to its toolbox. Also system administrators should add a lens which can view the dimension of development to its toolbox, because a full understanding about the new system can help them monitor the system more extensively.

The reason why I am aware of operation aspect of system architecture is that it is getting more and more important today. Service has been a buzzword for years. SOA, SaaS, PaaS, Web Services and so on. So how can we measure the quality of service? Yes, SLA(Service Level Agreement). 4 nines availability and 1s response time. That's it. But how can we reach that SLA? It is closely related to operation. So be watchful of it.

UPDATE: Here is a good post on the same topic: monitoring java system, but more specifically.

No comments: