Tuesday, January 27, 2009

Caching on the web

When you build a web application intended for a large scale, there are two separate performance issues when it comes to generating the HTML the server spits out.

One is server load. How many visitors can the server cope with?

The other thing is response time as seen by each visitor. How long must I wait for the page to load?

When using a relatively slow dynamic web framework (as Python + Django), the issue with server load is mostly related to CPU time. If you read the Django docs, the recommended solution is caching, i.e. instead of generating the HTML on each hit, we generate it once, store it for a while, handing out the stored snippet until it times out. It's a simple idea that works incredibly well.

First question is where we do the caching. To reduce the workload as much as possible, we want it to happen as close to the browser as possible. For a low-cost solution that works for everyone, we need to stick to the same physical location as the web server, however.

With Django on Apache, you've got some relatively fast C code serializing the requests to separate Python processes running the Django code. The process model limits the concurrency level possible because each process gobs a fair amount of memory, and Python is just generally an order of magnitude slower than C. So while putting the cache in the web framework is arguably very convenient, it's also pretty slow. From my experiments, one or two orders of magnitude.

Ideally, we'd just run a simple process written with speed in mind in front of Apache to do the caching. Afterall, the idea is relatively straightforward. Enter Varnish, a web cache written in C.

With Varnish in place, you know that the server only has to generate a page once per timeout. Some of the pages on YayArt take about 1 second to generate because they need to do multiple complex database queries. Let's say we have a beefy server that can do 10 of these in parallel. Then we could support 10 requests/s. With caching in Varnish, we can scale up to whatever it takes Varnish to retrieve a string from its cache and serve it. You can easily reach 6000 requests/s. For comparison, 100 requests/s is 8.6 million hits/day.

However, as it turns out, while simple whole page caching can solve a large part of the server load problem with one blow without giving up the convenience of a dynamic web framework, it's not necessarily the solution to the response time. If the page is in the cache, the response time is close to perfect. But what if it's a slow day with few visitors and it's not?

The general solution is to precompute the answer. Unfortunately, how to do this is more application specific.

You might be wondering how to combine caching with personalized pages. On YayArt I use Javascript to integrate the per-user data into the page through AJAX. So the per-user server load is reduced to processing the actual per-user data instead of a whole page.

Sunday, January 25, 2009

Market economy

For Christmas, my parents bought me 3 months of subscription to Information, a small Danish newspaper with a focus on analysis and criticism of the daily politics. The audience is mostly intellectual (and also mostly left-wing).

Information is interesting for several reasons, one of them being that major politicians in Denmark, the people who actually make the rules for the rest of us, regularly post in the newspaper, and thus presumably also read it. For an old internet-addict, that's a whole new experience.

There's been a lot of talk lately about capitalism in the face of the current financial crisis. The thing is that over the past 20 years, almost everyone in the political landscape has been pursuing a the-more-market-economy-and-deregulation-the-better strategy to a some degree. And now it turns out that too few rules for the gamblers in the financial sector have put us into a global recession. So people are, again, beginning to question whether capitalism is such a hot idea after all.

When I went to high school, I always thought it was possible to do better. Because there is such an obvious waste in our market economy. For an example, go to your nearest supermarket and look at the shelves with shampoo and hair products. Or the shelves with soft drinks. Or the shelves with morning cereals for children. And ask yourself, how much value does all these colourful and overly expensive things bring to our society? Or the really classy ads that never mention any factual qualities in their products, but instead try to install in us an irrational idea of their products, because of the packaging so to speak, bringing improvements to our lives.

I ask you, how can drinking water with added sugar and various brown chemicals make you cool? Because a large company has invested enormous efforts in convincing everyone that it is so.

Surely ration can do better than that.

However, at university I spent a couple of years working with distributed systems. The most important lesson I learned is that centralized systems are bad, unless the scale is very small. Decentralized (peer-to-peer) systems are more adaptive to change, more robust and much more efficient - several orders of magnitude as the system scales. Yes, there's waste, redundancy, suboptimal behaviour.

But it's self-organizing. If I want something from a peer, I just ask - I don't need to contact a central authority which then has to decide how to respond to the change. Overall it just works incredibly better.

I think that the same is true of society. With a market economy, decision-making is decentralized, in spite of the tendency for old industries to advance towards monopolies. In the large scale, this decentralization is unbelievably effective compared to a centralized control because of the complexity involved.

It's also a lot more free than a democracy. In a democracy, 55% can decide for the remaining 45% (that's why we haven't built any new windmills in Denmark lately). With a market economy, everyone decides for himself. Within the limits of the market, of course.

However, not all is good. One of the more peculiar problems of market economy is the drive towards monopolies, i.e. a collapse of the market (read a treatise of Karl Marx's works to understand why). Like any game we set up, it needs rules.

Karl Marx

And the market cannot do long-term thinking. As we have just seen, it can happily drive over the cliff edge because of a phenomenon called the tragedy of the commons. It doesn't hurt me a lot that I exploit the system: I get the whole benefit, the downside is shared between everyone.

The tragedy is that even if some market players want to stop, they face the competition from the others. If price is the only factor, a short-term thinking competitor can drive the others off the market. So they are forced to follow, unless a force beyond the market sets down rules that cannot be ignored. The same reasoning goes for unethical behaviour.

So the issue here is how we change the rules of the game to ameliorate the bad things without throwing the basic idea out of the window. For in spite of the waste, it's working better than the alternatives. Ration is bounded, it's not enough to deal satisfactorily with the needs of millions of humans.

Friday, January 23, 2009

Converting MySQL to UTF-8 the easy way

This is just a quick note to myself.

So you're running on the latest version of INSERT NAME OF FANCY WEB FRAMEWORK HERE? Think character set problems are a relic of the past? Not so with MySQL. The default configuration is using Latin-1. When you install MySQL, the first you should do is ensure MySQL is using UTF-8. This problem will go away at some point when distributions change the defaults, but until then.

Meanwhile, if you're like me, you might have created some tables before discovering the problem. It is, after all, difficult to see before you create the tables. So here's a recipe for converting a whole database (idea stolen from Wordpress).

First type in these commands, replacing mydb with the name of your database:

USE information_schema;
SELECT CONCAT('ALTER TABLE ', table_name, ' MODIFY ', column_name, ' ', column_type, ' CHARACTER SET utf8;') FROM columns WHERE table_schema = 'mydb' and data_type LIKE '%char%';
SELECT CONCAT('ALTER TABLE ', table_name, ' MODIFY ', column_name, ' ', column_type, ' CHARACTER SET utf8;') FROM columns WHERE table_schema = 'mydb' and data_type LIKE '%text%';
SELECT CONCAT('ALTER TABLE ', table_name, ' CHARACTER SET utf8;') FROM tables WHERE table_schema = 'mydb';

This should output the commands you need to feed into MySQL to do the change. If you start the MySQL shell with -s, it's easier to copy-paste. Then type

USE mydb;
ALTER DATABASE mydb CHARACTER SET utf8;
[... pasted commands ...]

The problem here is that the character set is stored on multiple levels. Both column, table and database level. The second line fixes the character set of the database, and the pasted in commands fixes the columns and tables.

This works for Django. If you have used a framework that allows you to put UTF-8 characters into the Latin-1 columns, you need to do something else. The Wordpress link has the details.