Peter Wainwright, Pro Apache

Apache has been a stalwart of the Internet for some time. Not only is it well known as a web serving platform, but it also forms a key part of the LAMP (Linux-Apache-MySQL-Perl/Python/PHP) and is one of the best known open source projects. Getting an Apache installation right though can be tricky. In Pro Apache, Peter Wainwright hopes to help readers by using a task, rather than feature based, approach. I spoke to Peter about Apache, its supported platforms, the competition from IIS and his approach to writing such a mammoth tome.

High Performance Linux ClustersInflammatory questions first – Unix or Windows for Apache?

Unix. To be more precise, BSD, then Linux, then almost anything else (e.g., commercial Unixes), then Windows — if you must.

The usual technical arguments and security statistics against using Windows are readily available from a number of sources, so let me give a rather different perspective: it seems Microsoft was in discussion to buy Claria, creators of Gator (one of the more annoying strains of adware that infest Windows desktops). Coincidentally, Microsoft’s beta ‘AntiSpyware’ tool recently downgraded Claria’s products from quarantine to ignore. It seems that the deal fell through, but for reasons of bad PR rather than any concern for the customer. Call me cynical if you like, but I see little reason to place my faith in a closed-source operating system when the vendor is apparently willing to compromise the security of its customers for its own business purposes. Yes, plenty of us already knew that, but this is an example even non-technical business managers can grasp.

Having said that, yes, there are reasons why you might be required or find it otherwise preferable to run Apache on a Windows server. For example, you might need make use of a Windows-specific module or extension. Apache on Windows is perfectly viable – but given a free choice, go the open source route.

Do you prefer the text-based configuration, or the GUI based configuration tools?

Text-based every time. I don’t object to the use of a GUI outright, but if I can’t easily understand the generated configuration files by direct inspection afterwards, or can’t modify the configuration without upsetting the tool, I’ve just built a needless dependency on a tool when I would have been better off maintaining the text-based configuration directly. Using a tool not a substitute for understanding the underlying configuration.

Too many administrators, I think, use the default configuration file without considering whether it might be better to create a much simpler and more maintainable configuration from scratch. I find an effective strategy for maintaining an Apache configuration is to divide it into several simple configuration files according to function – virtual hosting, access control, SSL, proxies, and so on – and then include them into one master configuration file. If you know what your website (or websites) will be doing, you can configure only those features. A simpler configuration, in turn, generally means fewer security issues to deal with.

The default configuration file, if I make use of it at all, becomes just one of the files included into the master configuration file that takes its place. Customisations go into go into their own or files and override the defaults as necessary. This makes it very easy to see what configuration came pre-supplied and what was applied locally. It also it easy to update the default configuration as new releases of Apache come out, because there are no modifications in the file to carry across.

Can you suggest any quick ways to improve performance for a static site?

There are two main strategies for performance-tuning a server for the delivery of static content: finding ways to deliver the content as efficiently as possible, and not delivering the content at all, where possible. But before embarking on a long session of tweaking, first determine whether the load on the server or the available bandwidth is the bottleneck. There’s no point tuning the server if it’s the volume of data traffic that’s limiting performance.

Simple static content performance can be improved in Apache by employing tricks like memory-mapping static files or by caching file handles and employing the operating system’s sendfile mechanism (the same trick employed by kernel HTTP servers) to efficiently transfer static data to the client. Modules like Apache 1.3’s mod_mmap_static and Apache 2’s mod_file_cache make this easy to configure.

At the platform level, many operating systems provide features and defaults out of the box that are not useful for a dedicated webserver. Removing these can benefit performance at no cost and often improve security at the same time. For instance, always shut down the mail service if the server handles no mail. Other server performance improvements can be gained by reducing the amount of information written to log files, or disabling them entirely, or disabling last access-time updates (the noatime mount option for most Unix filesystems).

If the limiting factor is bandwidth, look to trade machine resources to reduce throughput with strategies like compressing server responses with mod_gzip. Also consider the simple but often-overlooked trick of reducing the bytesize of images (which compression generally won’t help with) that Apache is serving.

Arranging not to deliver the content can actually be easier, and this reduces both server loading and bandwidth usage. Decide how often the static content will change over time, then set configure caching and expiration headers with mod_cache (mod_proxy for Apache 1.3) and mod_expires, so that downstream proxies will deliver content instead of the server as often as possible.

To really understand how to do this well, there is no substitute for an understanding HTTP and the features that it provides. RFC2616, which defines HTTP 1.1, is concise and actually quite readable as RFCs go, so I recommend that all web server administrators have a copy on hand (get it from www.w3.org/Protocols/HTTP/1.1/rfc2616.pdf). That said, it is easy to set expiry criteria for different classes of data and different parts of a site even without a firm understanding of the machinery that makes it work. Doing so will enable the site to offload content delivery to proxies wherever possible. For example, tell proxies that all (or most) of the site’s images are static and can be cached, but the text can change and should never be cached. It may happen that most of the text is also static, but since images are generally far larger, marking them as static provides immediate benefits with a very small amount of configuration.

Security is a key issue. What are the main issues to consider with Apache?

Effective security starts with describing the desired services and behaviour of the server (which means both Apache and the hardware it is running on). Once you know that, it is much easier to control what you don’t want the server to do. It’s hard to protect a server from unwanted attention when you don’t have a clear idea of what kinds of attention is wanted.

I find it useful to consider security from two standpoints, which are also reflected in the book by having separate chapters. First is securing Apache itself. This includes not only the security-specific modules that implement the desired security policies of the server, but also the various Apache features and directives that have (sometimes non-intuitive) security implications. By knowing what features are required, you can remove the modules you don’t need.

Second, but no less important, is securing the server that Apache is running on. The security checklist in Pro Apache attempts to address the main issues with server security in a reasonably concise way, to give administrators something to start from and get them thinking in the right direction. One that’s worth highlighting is ‘Have an Effective Backup and Restore Process’ — it’s vital to know how to get your server back to a known state after a break-in, and being able to do so quickly will also stand you in good stead if a calamity entirely unrelated to security occurs, like a hard disc failure or the server catching fire (this actually happened to me). The ssh and rsync tools are very effective for making secure network backups and restores. They are readily available and already installed on most Unixes, so there’s no reason not to have this angle covered.

With the increased use of dynamic sites using PHP and Perl, how important and useful are functions like SSIs and rewriting which is built into Apache?

When designing a web application, use the right tool for each part of the job. Apache is good at handling connectivity and HTTP-level operations, so abstract these details from the application as far as possible. Rewriting URLs, which are simply one kind of many kinds of request mapping, are just an aspect of this. Similarly, don’t make a web application handle all its own security. Use Apache to handle security up front as much as possible, because it is expert at that, and if used properly will prevent insecure or malicious requests from reaching the application. Unfortunately, rather too many web application developers don’t really understand web protocols like HTTP and so build logic into the application that properly belongs in the server. That makes it more likely that a malicious request can find a weakness in the application and exploit it. It also means the application designers are not making use of Apache to its fullest potential.

Bear in mind that it is possible, with scripting modules like mod_perl, to plug handlers into different parts of the request-response cycle. Clever use of this ability allows a flexible modular design that is easier to adapt and less likely to create hidden security issues. Apache 2 also provides new and interesting ways to construct web applications in a modular fashion using filters. These features are very powerful, so don’t be afraid to exploit them.

I’ll admit to a fondness for Server Side Includes (SSIs). Even though they have been largely superseded by more advanced technologies, they are easy to use and allow for simple templating of static and dynamic content. Apache’s mod_include also knows how to intelligently cache static includes, so SSI-based pages are a lot faster than their basic mechanic would suggest, and without requiring any complex configuration. They’re a good choice for sites that have a lot of static content and need to incorporate a few dynamic elements.

Apache is facing an increasing amount of competition from Microsoft’s IIS, especially with the improvements in IIS 6.0. Ignoring the cost implications, what are the main benefits of Apache over IIS?

Trust. One of the reasons that Apache is a reliable, secure, and high-performance web server is because the Apache developers have them as end objectives. They’re not trying to sell you something. Having total flexibility to add or remove features, or inspect and modify the code if necessary, are almost bonuses by comparison.

On a more technical note, an Apache-based solution is of course readily portable to other platforms, which ties into the choice of platform we started out with. Although there are always exceptions, if you think there’s a feature that IIS provides that Apache cannot — bearing in mind you can always run Apache on Windows — chances are you haven’t looked hard enough.

Pro Apache is a mammoth title — where do you start with something as complex with Apache?

Too many books on computing subjects tend to orient themselves around the features of a language or application, rather than the problems that people actually face, which is not much help if you don’t already have some idea what the answer is in order to look it up. I try hard in Pro Apache to start with the problems, and then illustrate the various directives and configuration possibilities in terms of different solutions to those problems.

Even though there are a bewildering number of directives available, many of them are complimentary, or alternatives to each other, or are different implementations of the same basic idea. For example, take the various aliasing and redirection directives, all of which are essentially variations on the same basic theme even if they come from different modules (chiefly, but not exclusively, mod_alias and mod_rewrite). Understanding how different configuration choices relate to each other makes it easier to understand how to actually use them to solve problems in general terms. A list of recipes doesn’t provide the reader with the ability to adapt solutions to fit their own particular circumstances.

I also try to present several different solutions to the same problem in the same place, or where that wasn’t practical, provide pointers to alternative or complimentary approaches in other chapters. There’s usually more than one way to achieve a given result, and it is pretty unlikely, for example, that an administrator trying to control access through directives like BrowserMatch and RewriteRule will discover that the SSLRequire is actually a general-purpose access control directive that could be the perfect solution to their problem. (SSLRequire is my favourite ’secret’ directive, because no one thinks to find a directive for arbitrary access control in an SSL module.)

Since many administrators are still happily using Apache 1.3, or have yet to migrate, the updates made to the first edition of Pro Apache (then called Professional Apache and published by Wrox) to cover Apache 2.0 do not separate coverage of the 1.3 and 2.X releases except where they genuinely diverge. The two versions are vastly more similar than they are different — at least from the point of view of an administrator — and in order to be able to migrate a configuration or understand the impact of attempting to do so, it was important to keep descriptions of the differences between the two servers tightly focused. To do this, coverage of the same feature under 1.3 way and 2.X are presented on the same page wherever possible.

It seems unlikely considering the quality of the content, but was there anything you would have liked to include in the book but couldn’t squeeze in?

With a tool as flexible as Apache, there are always more problems to solve and ways to solve them than there is space to cover, but for the most part I am very happy with the coverage the book provides. Judging by the emails I have received, many people seem to agree. If there’s anything that would have been nice to cover, it would probably be some of the more useful and inventive of the many third-party modules. A few of the more important, like mod_perl, are covered by the last chapter, but there are many so many creative uses to which Apache has been put that there will always be something there wasn’t the space or time to include.

What do you do to relax?

Strangely enough, even though I spend most of my working time at a computer, I’ve found that playing the odd computer game helps me wind down after a long day. I think it helps shut down the parts of my brain that are still trying to work by making them do something creative, but deliberately non-constructive. I recommend this strategy to others too, by the way; board games, or anything similar, work too.

To truly relax, I’ve found that the only truly effective technique is to go somewhere where I don’t have access to email, and determinedly avoid networks of any kind. I suspect this will cease to work as soon as mesh networks truly take hold, but for now it’s still the best option. It also helps that I have a wonderful, supportive wife.

What are you working on next?

Right now I’m gainfully employed and wielding a great deal of Perl at some interesting problems to do with software construction in the C and C++ arena. There’s been some suggestion that a book might be popular in this area, so I’m toying with that idea. I also maintain an involvement in commercial space activities, specifically space tourism, which has recently got a lot more popular in the public imagination (and about time too, some of us would say). That keeps me busy in several ways, the most obvious of which is the ongoing maintenance the Space Future website at www.spacefuture.com.

Author Bio

Peter Wainwright is a developer and software engineer specializing in Perl, Apache, and other open-source projects. He got his first taste of programming on a BBC Micro and gained most of his early programming experience writing applications in C on Solaris. He then discovered Linux, shortly followed by Perl and Apache, and has been happily programming there ever since.

When he is not engaged in development or writing books, Wainwright spends much of his free time maintaining the Space Future website at www.spacefuture.com. He is an active proponent of commercial passenger space travel and cofounded Space Future Consulting, an international space tourism consultancy firm.