Joseph D Sloan, High Performance Linux Clusters

Getting the best performance today relies on deploying high performance clusters, rather than single unit supercomputers. But building clusters can be expensive, but using Linux can be both a cheaper alternative and make it easy to develop and deploy software across the cluster. I interview Joseph D Sloan, author of High Performance Linux Clusters about what makes a cluster, how Linux cluster competes with Grid and proprietary solutions and how he got into clustering technology in the first place.

High Performance Linux ClustersClustering with Linux is a current hot topic - can you tell me a bit about how you got into the technology?

In graduate school in the 1980s I did a lot of computer intensive modeling. I can recall one simulation that required 8 days of CPU time on what was then a state-of-the art ($50K) workstation. So I’ve had a longtime interest in computer performance. In the early 1990s I shifted over to networking as my primary interest. Along the way I set up a networking laboratory. One day a student came in and asked about putting together a cluster. At that point I already had everything I needed. So I began building clusters.

The book covers a lot of material - I felt like the book was a complete guide, from design through to implementation of a cluster - is there anything you weren’t able to cover?

Lots! It’s my experience that you can write a book for beginners, for intermediate users, or advanced users. At times you may be able to span the needs of two of these groups. But it is a mistake to try to write for all three. This book was written to help folks build their first cluster. So I focused on the approach that I thought would be most useful for that audience.

First, there is a lot of clustering software that is available but that isn’t discussed in my book. I tried to pick the most basic and useful tools for someone starting out.

Second, when building your first cluster, there are things you don’t need to worry about right away. For example, while I provide a brief description of some benchmarking software along with URLs, the book does not provide a comprehensive description of how to run and interpret benchmarks. While benchmarks are great when comparing clusters, if you are building your first cluster, to what are you going to compare it? In general, most beginners are better off testing their cluster using the software they are actually going to use on the cluster. If the cluster is adequate, then there is little reason to run a benchmark. If not, benchmarks can help. But before you can interpret benchmarks, you’ll first need to know the characteristics of the software you are using-is it I/O intensive, CPU intensive, etc. So I recommend looking at your software first.

What do you think the major contributing factor to the increase of clusters has been; better software or more accessible hardware?

Both. The ubiquitous PC made it possible. I really think a lot of first-time cluster builders start off looking at a pile of old PCs wondering what they can do with them. But, I think the availability of good software allowed clusters to really take off. Packages like OSCAR make the task much easier. An awful lot of folks have put in Herculean efforts creating the software we use with very little thought to personal gain. Anyone involved with clusters owes them a huge debt.

Grids are a hot topic at the moment, how do grids - particularly the larger toolkits like Globus and the Sun Grid Engine - fit into the world of clusters?

I’d describe them as the next evolutionary stage. They are certainly more complex and require a greater commitment, but they are evolving
rapidly. And for really big, extended problems, they can be a godsend.

How do you feel Linux clusters compare to some of the commercially-sourced, but otherwise free cluster technology like Xgrid from Apple?

First, the general answer: While I often order the same dishes when I go to a restaurant, I still like a lot of choices on the menu. So I’m happy to see lots of alternatives. Ultimately, you’ll need to make a choice and stick to it. You can’t eat everything on the menu. But the more you learn about cooking, the better all your meals will be. And the more we learn about cluster technology, the better our clusters will be.

Second, the even more evasive answer: Designing and building a cluster requires a lot of time and effort. It can have a very steep learning curve. If you are already familiar with Linux and have lots of Linux boxes, I wouldn’t recommend Xgrid. If you are a die-hard Mac fan, have lots of Mac users and systems, Xgrid may be the best choice. It all depends on where you are coming from.

The programming side of a grid has always seemed to be the most complex, although I like the straightforward approach you demonstrated in the book. Do you think this is an area that could be made easier still?

Thanks for the kind words. Cluster programming is now much easier than it was a decade ago. I’m a big fan of MPI. And while software often lags behind hardware, I expect we’ll continue to see steady improvement. Of course, I’m also a big fan of the transparent approach taken by openMosix and think there is a lot of unrealized potential here. For example, if the transparent exchange of processes could be matched by transparent process creation through compiler redesign, then a lot more explicit parallel programming might be avoided.

What do you think of the recent innovations that puts a 96-node cluster into a deskside case?

The six-digit price tag is keeping me out of that market. But if you can afford it and need it …

Go on, you can tell me, do you have your own mini cluster at home?

Nope-just an old laptop. I used to be a 24/7 kind ‘a computer scientist, but now I try to leaving computing behind when I go home.
Like the cobbler’s kid that go without shoes, my family has to put up with old technology and a husband/father that is very slow to respond to their computing crises.

When not building clusters, what do you like to do to relax?

Relax? Well my wife says …

I spend time with my family. I enjoy reading, walking, cooking, playing classical guitar, foreign films, and particularly Asian films. I tried learning Chinese last year but have pretty much given up on that. Oh! And I do have a day job.

This is your second book - any plans for any more?

It seems to take me a couple of years to pull a book together, and I need a year or so to recover between books. You put so many things on hold when writing. And after a couple of years of not going for a walk, my dog has gotten pretty antsy. So right now I’m between projects.

Author Bio

Joseph D. Sloan has been working with computers since the mid-1970s. He began using Unix as a graduate student in 1981, first as an applications programmer and later as a system programmer and system administrator. Since 1988 he has taught computer science, first at Lander University and more recently at Wofford College where he can be found using the software described in this book.

You can find out more on the author’s website. More information on the book, including sample chapters, is available at O’Reilly.

Improved application development: Part 1, Collating requirements for an application

My latest Rational piece is up on the IBM site. This is an update of the series I co-wrote last year on using a suite of Rational tools for your development projects. The latest series focuses on the new Rational Application Developer and Rational Software Modelere, which are based on the new Eclipse 3.0 platform.

Developing applications using the IBM Rational Unified Process is a lot easier if you have the tools to help you throughout the process. The Rational family of software offers a range of tools that on their own provide excellent support for each phase of the development process. But you can also use the different tools together to build an entire application. By sharing the information, you can track components in the application from their original requirement specification through testing and release. This first part of a five-part series shows how to use Rational RequisitePro to manage and organize the requirements specification for a new project. Then, after you’ve developed your unified list of requirements, the tutorial shows how to use Rational Software Modeler to model your application based on those requirements.

You can read the full article.

If you’ve finished it and want more, check out Improved application development: Part 2, Developing solutions with Rational Application Developer.

Using HTTP Compression

I have a new article up at ServerWatch which looks at the benefits and configuration of HTTP compression within Apache and IIS. Here’s an excerpt from the intro:

There’s a finite amount of bandwidth on most Internet connections, and anything administrators can do to speed up the process is worthwhile. One way to do this is via HTTP compression, a capability built into both browsers and servers that can dramatically improve site performance by reducing the amount of time required to transfer data between the server and the client. The principles are nothing new — the data is simply compressed. What is unique is that compression is done on the fly, straight from the server to the client, and often without users knowing.

HTTP compression is easy to enable and requires no client-side configuration to obtain benefits, making it a very easy way to get extra performance. This article discusses how it works, its advantages, and how to configure Apache and IIS to compress data on the fly.

Read on for the full article.

Interview with Tom Jackiewicz, author of Deploying OpenLDAP

My first article for LinuxPlanet is an interview with the author of Deploying OpenLDAP, Tom Jackiewicz. The book is an excellent guide to using and abusing the OpenLDAP platform.

As well as the contents of the book, I talked with Tom about the uses and best environments for LDAP solutions, as well as technical requirements for OpenLDAP. We also have a little discussion about the complexities of the LDAP system.

You can read the full interview.

Tom Jackiewicz, Deploying OpenLDAP

OpenLDAP is the directory server of choice if you want a completely free and open source solution to the directory server problem. Tom Jackiewicz is the author of Deploying OpenLDAP, a title that aims to dissolve many of the myths and cover the mechnanics of using OpenLDAP in your organization. I talked to him about his book, his job (managing OpenLDAP servers) and what he does when he isn’t working on an LDAP problem.

Deploying OpenLDAPCould you summarize the main benefits of LDAP as a directory solution?

There are many solutions to every problem. Some solutions are obviously better than others and they are widely used for that reason. LDAP was just one solution for a directory implementation. Some people insist that Sony’s BetaMax was a better solution than VHS–unfortunately for them, it just didn’t catch on. The main benefit of using LDAP as a directory solution is the same reason people use VHS now. There might be something better out there but people haven’t heard of it, therefore it gets no support and defeats the idea of having a centralized directory solution in place. Bigger and better things out there might exist but if they stand alone and don’t play well with others, they just don’t fit into the overall goals of your environment.

If you deploy any of the LDAP implementations that exist today, you instantly have applications that can tie into your directory with ease. Because of this reason, what used to be a large scale integration project becomes something that can actually be accomplished. I’m way into standards. I guess LDAP was simple enough for everyone to implement and just caught on. If LDAP existed in the same form it does today but another directory solution was more accepted, maybe I’d be making arguments against using LDAP.

Please read the rest of the interview at LinuxPlanet.

Finding alternatives in developing software

My latest article over at Free Software Magazine is available. This time, I’m looking at the role of free software in development, both of free and proprietary applications. I discuss the benefits of free software and the pitfalls of proprietary solutions. Here’s an extract of the intro:

Developing software within the free software model can be achieved with all sorts of different tools, but choosing the right tools can make a big difference to the success of your project. Even if you are developing a proprietary solution, there are benefits to using free software tools to achieve it. But what free software tools are available? In this article I’m going to look at the development tools available, from languages and libraries to development environments, as well as examining the issues surrounding the use of free software tools by comparison to their proprietary equivalents.

You can read the full article.

Patrick Koetter, Ralf Hildebrandt, The Book of Postfix

Postfix is fast becoming a popular alternative to sendmail. Although it can be complex to configure, it’s easier to use Postfix with additional filtering applications, for example Spam and virus filters, than with some other mail transfer agents. I spoke to Patrick Koetter and Ralk Hildebrandt about The Book of Postfix, the complexities of configuring Postfix, Spam, and email security.

The Book of PostfixHow does Postfix compare to sendmail and qmail?

Ralf Hildebrandt (RH): As opposed to sendmail, Postfix was built with security in mind.

As opposed to qmail, Postfix was built for real-life systems in mind that have to adapt to the hardships of the Internet today. qmail is effectively unmaintained.

Patrick Koetter (PK): That’s a tough question because I am not one of those postmasters who spent half their life working with Eric Allman’s Sendmail nor did I spent too much time enlarging my knowledge on qmail, so I can’t give you an in detail answer that will really tackle specific features or functionalities.

Let me give it a different spin and try if that answers it:

When I took out to run my first own mailserver I looked at Sendmail, qmail and Postfix.

Sendmail to me was too complicated to configure and since my knowledge of the M4 macro language was very little, but my fear of loosing e-mail or even configuring my server to be an open relay was large I dropped it. The ongoing rally of CERT Advisories about this or that Sendmail exploit by then didn’t make it a hard choice.

Then I took a look at qmail, but wasn’t really sure I wanted it because it is more or less a series of patches if you want to use it with nowadays feature range. But I gave it a try anyway and ended up asking some questions on the mailing list because the documentation would not answer what I was looking for.

To cut it short: I was under the impression you had to enter the “Church of qmail” before anyone would take the time to answer a question to a qmail novice. It might have changed since then, but back then I left and I never looked back because all I wanted was to run a MTA.

Finally I took a look at Postfix and was very surprised by the amount of documentation that was available. I also immediately fell in love with the configuration syntax, which seemed to simple and clear to me. For a while I thought this must be a very feature limited MTA, but the more I read the more I understood that it did almost the same things, but was simply easier to configure.

I finally decided to stick with Postfix after I had joined the Postfix mailing list and found out that people really cared for my questions, pointed me to documentation to read again or give me advice on how to do this or that more efficient.

Of course, as the Postfix community grew larger, one or the other character turned up who would rather lecture someone seeking help, but the overall impression still remains the same.

Postfix is well maintained, its security record is unbeaten up to now and the community is how I wished every community supporting a software should be. The modular software architecture Wietse Venema has chosen makes it easy to expand Postfix’ capabilities. Its a system that can grow very well. I haven’t seen another piece of software that does the complex job of being a MTA that well.

Postfix seems a little complex to install - there are quite a few configuration files, some of which seem to contain arcane magic to get things working. Is this a downside to the application?

PK: That’s the provoking question, isn’t it? ;)

To me Postfix is as simple or complex as the process of mail transport itself is. That’s why we added so many theory chapters to the book that explain the e-mail handling process before we took out to explain how Postfix does it in the follow-up chapter. If you understand the process its pretty straightforward to configure Postfix to deal with it.

But basically all you need is three files, main.cf, master.cf and the aliases file. Wait! You could even remove the main.cf file and Postfix would work with reasonable defaults on this specific server.

The main.cf file carries all parameters that are applied globally. If you need options that are specific to a special daemon and should override global options from main.cf, you add them in master.cf in the context of that special daemon. That’s the basic idea of configuring Postfix.

Then there is a lot of tables in the /etc/postfix directory, which you usually don’t need unless you take out or configure a specific feature that isn’t part of basic functionality.

Sure, the amount of tables might frighten a novice, but then they are there for the sole purpose of supporting a novice and even advanced users because they hold the documentation about what the specific table is about and how you would add entries to the table if you wanted to use it.

The rest is complexity added by additional software, for example Cyrus SASL which is a royal pain for beginners.

Of course your mileage will vary when you take out to configure a full blown MTA that incorporates Anti-Spam measures, Anti-Virus checking, SMTP Authentication and Transport Layer Security, where Postfix looks up recipient names and other information from an LDAP server that also drives an IMAP MTA.

But when you begin it boils down to the two configuration files and an aliases file.

As for the “arcane magic” I don’t exactly understand what you relate to so I do some speculation based on my own experiences.

I struggled with smtpd_*_restrictions for quite a while until I realized: “Its the mail transport process that makes it so hard to understand.” Once you’ve understood how a SMTP dialog should be processed it suddenly seems very simple. This is at least what happened to me. I recall hours sitting in front of these restrictions, Ralf ripping hair out of his head and looking at me as if I was from another planet.

The quote we used in the restrictions chapter alludes to that day and it also contains the answer I came up with: “To know what to restrict you need to know what what is.” I looked the “what” parts up in the RFCs, understood what smtpd_*_restrictions were all about and saved Ralf from going mad ;)

But that’s specific to smtpd_*_restrictions. For all other parameters and options it pays to read RFCs as well, but you can get very far by reading the excellent documentation Wietse has written _and_ by looking at the mere names he used for the parameters. Most of the time they speak for themselves and tell you what they will do. I think Wietse has done a great job at thinking of catchy self-explanatory parameter names.

RH: Postfix works with the default main.cf and master.cf. If you have advanced requirements, the configuration can get elaborate. But configuration files like I created them and also offer them on http://www.stahl.bau.tu-bs.de/~hildeb/postfix/ have evolved over several years of use (and abuse of the Internet by Spammers) - I never thought “That’s the way to do it”, but it was rather “trial and error”.

Postfix seems to work exceptionally well as a mail transport agent - i.e. one that operates as an intermediate relay or relayhost (I’ve just set up a Postfix relay that filters spam and viruses, but ultimately delivers to a sendmail host, for example). Is this because of the flexible external interface Postfix uses?

RH: It also works excellent as a mailbox host :) Over the years, Wietse added features for content filtering and the ability to specify maps that tell the system which recipient addresses should be accepted and send on further inwards.

That makes it easy to say “Instead of throwing away our old Product-X server, we simply wedge Postfix in between”

But there’s no special preference as “intermediate relay”. It’s an universal MTA. We use it everywhere. Also for the server handling the mailboxes. Or our list exploder.

Do you have a preferred deployment platform for postfix?

PK: Basically I go for any platform that suits the needs. As for Linux I prefer distributions that don’t patch Postfix, but that’s only because I support many people on SMTP AUTH issues on the Postfix mailing list and some maintainers have taken to do this or that different, which makes configuring SMTP AUTH even harder.

Personally I’d go for RedHat Linux because I know it the best and produce good results faster as on other platforms. But then I wouldn’t hesitate a second to go for something else if it suits the scenario better. That’s another side of Postfix I like very much: It runs on many many systems.

RH: Debian GNU/Linux with Kernel 2.6.x. Patrick begs to differ on the Debian thing. Anyway, it works on any Unixoid OS. I ran it on Solaris and HP-UX back in the old days.

You cover the performance aspects of Postfix. Is it particularly taxing on hardware?

PK: That’s a question that turns up regularly on the Postfix mailing list. Read the archives… ;)

But seriously, you can run Postfix for a single domain on almost any old hardware that flies around. If your OS works with the hardware Postfix will probably go along with it as well.

The more domains you add the more mail you put through the likelier of course that you will get to the limits. But those limits usually aren’t limits imposed by Postfix, but by the I/O performance of your hardware.

Think of it this way: Mail Transport is about writing, moving and copying little files in the filesystem of your computer. The MTA receives a mail from a client and writes it to a mail queue where it waits for further processing. A scheduler determines the next job for the file and the message is moved to another queue. There it might wait another while until it gets picked up again to be delivered to another, maybe remote destination. If the remote server is unreachable at the moment it will be written back to the filesystem again to another queue and so an and so on until it finally can be removed after successful delivery.

The calculation to decide what to do with the mail doesn’t take a lot of time, but writing, moving and copying the file takes a lot longer. That’s due to the limitations of hardware. Hard discs nowadays really can store a lot of e-mail away, but the access speed didn’t grow at the same time. Still you need to stick to them because storing the message in a temporary device would lose the mail if the system was turned off suddenly.

So the basic rule is to get fast discs, arrays and controllers when you need to handle _a lot_ email. Regular hardware does it for private users quite well.

Another slowdown you should be prepared to expect is when you integrate Anti-Spam and Anti-Virus measures. They do not only require to read and write the files they also examine the content which often requires to unpack attached archives. This will temporary eat some of your CPU. But that’s something current hardware can deal with as well.

For hard facts you will need to find somebody who is willing to come up with a real world and well documented test scenario. So far one or the other has posted “measurement data”, but none of them would really tell about their setup and how they tested. Also I don’t know about a sophisticated comparison of Sendmail, qmail and Postfix.

Most of the “comparisons” I’ve heard weren’t able get rid of the odor of “because you wanted it to be better”.

Such tests are not what Postfix is and, as far as I can say without asking him, isn’t Wietse Venema about. I vividly recall him posting “Stop speculating, start measuring!” to someone who came up with a performance problem. I like that attitude a lot, because comparisons should be about facts and not believe.

I enjoyed the in-depth coverage on using certificate based security for authenticating communication between clients and servers. Do you see this as a vital step in the deployment process?

PK: Vital or not depends on your requirements and your in-house policy. Personally I do like certificate based relaying a lot and I think it should be used more widely, because you could really track spam a lot better down and would gain a more secure mail transport at the same time, but then certificate based relaying simply lacks the critical mass of servers and clients supporting it.

As long as you don’t have the critical mass of servers and clients using it there will always be a relay that does it without and that can be tricked to relay spam one or the other way and you loose track of the sender.

It also takes more work to configure, but especially maintain certificate based relaying because you need to maintain the list of certificates. You need to remove the ones that are expired, add others, hand out new ones, this and that…

I think its a “good thing to do [TM]” if you use it in your company, have many mobile users, but most of all (!) have all clients and serves under your control. Then you can automatize some of the work that needs to be done and all that together can pay up for the security and simplicity you get on your network.

But I doubt any private user would be willing to pay the additional fee for maintenance not to mention the certificate infrastructure to maintain the certificates themselves.

Was it Yahoo who had some certificate based Anti-spam measure on their mind? So many attempts to fix the effects of Spam… I think what we really need is a redesign of SMTP to cope with the current challenges. But that’s another topic and I am certainly not the one to be asked how it should be done. ;)

Is it better to use files or MySQL for the control tables in Postfix?

RH: “He said Jehova!”

Performance-wise mysql just sucks. The latency for queries is way higher than when asking a file based map. But then with mysql maps, any changes to the map become effective immediately, without the daemons that use the map having to exit and restart again. If your maps change often AND you get a lot of mail: mysql In all other cases: file based maps.

And: Keep it simple! If you don’t NEED mysql, why use it?

PK: I don’t think there’s a better or worse, because either way you loose or you gain something, but what you loose and gain aren’t the same things:

From a performance point of view you loose a lot of time when you use SQL or LDAP databases because of their higher lookup latency so you might want to stick with the files.

But then, if you host many domains, you win a lot when you maintain the data in a database. You can delegate many administrational tasks to the end user who accesses such a database through some web frontend. So there’s the pro for databases.

If you need both, performance and maintainability, you can build a chain from databases and files. The editing is done in the database and job on your computer checks the database on a regular base and builds (new) files from it when the data has changed. This way you get the best of both worlds for the price of a little delay after changes had been done in the database.

IMAP or POP?

PK: An old couple sits in the kitchen at home.

She: “Let’s go to the movies.”
He: “But we have been to the movies just recently…”
She: “Yes, but they show movies in colour AND with sound now!”

Definitely IMAP ;)

RH: Depends on your needs. Let the user decide: go for courier-imap (which also does pop), so the user can choose.

Is there a simple solution to the spam problem?

RH: Mind control? Orbital lasers? No, but Postfix’s restrictions and the possibility of delegating policy decisions to external programs can help.

PK: No, unfortunately not. There are too many reasons why Spam works and a working solution would have to be technical, political and business oriented at the same time.

First of all it works because the SMTP protocol as designed has little to no means to prove that a message was really sent by the sender given in the e-mail. Anybody can claim to be anybody. As long as this design problem persists it will cost a fortune to track spammers down.

Even if you know where the spam came from the spammer might have redrawn to a country that don’t mind spammers and will protect them from being pursued by foreign law.

The world simply lacks anti-spam laws all countries agree on. You typically are forced to end your chase for a spammer the moment you pass another countries borders because you are not entitled to chase the suspect.

Still, if you where entitled to do so, if costs a fortune to track a spammer down and even then it might take ages to get some money for the damage they have done. Is your company willing to pay that much just to nail one spammer down when another two emerge the moment the one goes behind bars?

And then Spam works, because it is so cheap. You buy a hundred thousand addresses for 250 bucks or even less and IIRC Yahoo found out that 1/3 of their mail users read spam and VISIT the pages they promote.

If one wants to make it go away one must make it expensive for those that send or endorse spam. If you ruin the business concept no one will send spam. That’s business… ;)

To sum my position up: The problem is global and we don’t have the right tools to hinder the cause. Currently all we can do is diminish the effect, by using as many anti-spam features as we can think of.

Do either of you have a favourite comic book hero?

PK: The “Tasmanian Devil” is my all time favourite. I even have a little plastic figure sitting in front of me under my monitor, which has become some kind of talisman. It reminds me to smile about myself on days where I’d rather go out and kill somebody else for not being the way I would want them to be ;)

RH: Calvin (of Calvin and Hobbes)
or
Too much Coffee Man!

Author Bios
Ralf Hildebrandt and Patrick Koetter are active and well-known figures in the Postfix community. Hildebrandt is a systems engineer for T-NetPro, a German telecommunications company, and Koetter runs his own company consulting and developing corporate communication for customers in Europe and Africa. Both have spoken about Postfix at industry conferences and contribute regularly to a number of open source mailing lists.

Cristian Darie, Mihai Bucica; Beginning PHP 5 and Mysql E-Commerce

PHP and MySQL are common solutions in many web development situations. However, when using them for e-commerce sites some different techniques should be employed to get the best out of the platforms. I talked to Cristian Darie and Mihai Bucica about their new book which uses an interesting approach to demonstrating the required techniques; the book builds an entire T-Shirt ordering shop.

Beginning PHP 5 and Mysql E-CommerceCould you give me, in a nut shell, the main focus of the book?

When writing “Beginning PHP 5 and MySQL E-Commerce”, we had two big goals of equal importance in mind. The first goal was to teach the reader how to approach the development of a data-driven web application with PHP and MySQL. We met this goal by taking a case-study approach, and we did our best to mix new theory and practice of incremental complexity in each chapter.

The second goal was to provide the knowledge necessary to build a fully functional e-commerce website. We did our best to simulate development in a real world environment, where you start with an initial set of requirements and on a low budget, and along the way (eventually after the company expands), new requirements show up and need to be addressed.

You can check out the website that we build in the book at http://web.cristiandarie.ro:8080/tshirtshop/.

Why use PHP and MySQL for e-commerce? Do you think it’s easier to develop e-commerce sites with open source tools like PHP and MySQL?

Generally speaking, the best technology is the one you know.

PHP and MySQL is an excellent technology mix for building data-driven websites of small and medium complexity. The technologies are stable and reliable, and the performance is good.

However, we actually don’t advocate using any particular technology because we live in the real world where each technology has its strengts and weaknesses, and each project has its own particularities that can lead to choosing one technology over the other. For example, if the client already has an infrastructure built on Microsoft technologies, it would probably be a bit hard to convince him or her to use PHP.

As many already know, for developers that prefer (or must use) ASP.NET and SQL Server, Cristian co-authored a book for them as well - “Beginning ASP.NET 1.1E-Commerce: From Novice to Professional”, with the ASP.NET 2.0 edition coming out later this year.

You de-mystify some of the tricks of the e-commerce trade - like rankings and recommendations; do you think these tricks have a significant impact on the usability of your site?

The impact of this kind of tricks is very important not only from the usability point of view, but also because the competitors already have these features implemented. If you don’t want them to steal your customers (or sell more than you do), read the 9 pages chapter about implementing product recommendations, and add that feature for your own website as well.

You don’t use transactional database techniques in your book - is this something that you would recommend for heavy-use sites?

Yes. “Beginning PHP 5 and MySQL E-Commerce” is addressed to beginning to intermediate programmers, building small to medium e-commerce websites - as are the vast majority of e-commerce websites nowadays. The architecture we’re offering is appropriate for this kind of website, and it doesn’t require using database transactions. For a complex, heavy-traffic website, a more advanced solution would be recommended, and we may write another book to cover that scenario.

The book shows in detail the implementation of an e-commerce website - do you know of anybody using this code for their own site?

Although the book is quite new, we’ve received lots of feedback from readers, some of them showing us their customized solutions based on the code shown in this book. Some of these solutions are about to be launched to production.

Credit card transactions always seemed to be the bane of e-commerce, especially for open source technology. Is it true that this has become easier recently?

Yes. Because much more e-commerce websites are built with open source technologies than they used to be, the payment gateways have started providing APIs, documentation and examples for PHP, just as they are doing for .NET and Java. This makes the life of the developer much easier.

Larger e-commerce applications may require more extensive deployment environments - are the techniques you cover here suitable for deployment in a multi-server environment?

PHP has its own limitations that make it innapropriate for extremely complex applications, but for the vast majority of cases PHP is just fine. The techniques we cover in the book aren’t meant to be used in multi-server environments; for these kinds of environments PHP may not be your best choice, but then again, it all depends on the particularities of the system.

Obviously PHP and MySQL provide the inner workings to an e-commerce site. Do you think the website design is as important as the implementation?

Of course, the website design is critical, because it reflects the “face” of your business. As we’ve mentioned in the book, it just doesn’t matter what rocket science was used to build the site, if the site is boring, hard to find, or easy to forget. Always make sure you have a good web designer to complement the programmers’ skills.

What do you do to relax?

Well, we’re both doing a good job at being 24 years old…

Author Bios

Mihai Bucica

Mihai Bucica started programming and competing in programming contests (winning many of them), all at age twelve. With a bachelor’s degree in computer science from the Automatic Control and Computers Faculty of the Politehnica University of Bucharest, Romania, Bucica works as an Outsourcing Project Manager for Galaxy Soft SRL. Even after working with a multitude of languages and technologies, Bucica’s programming language of choice remains C++, and he loves the LGPL word.

Cristian Darie

Cristian Darie, currently the technical lead for the Better Business Bureau Romania, is an experienced programmer specializing in open source and Microsoft technologies, and relational database management systems. In the last 5 years he has designed, deployed, and optimized many data-oriented software applications while working as a consultant for a wide variety of companies. Cristian co-authored several programming books for Apress, Wrox, and Packt Publishing, including Beginning ASP .NET 2.0 E-Commerce, Beginning PHP 5 and MySQL E-Commerce, Building Websites With The ASP.NET Community Starter Kit, and The Programmer’s Guide to SQL. Cristian can be contacted through his personal website, www.CristianDarie.ro.

Garrett Rooney, Practical Subversion

Subversion is having what can only be described as a subversive effect on the versioning software environment. CVS has long been the standard amongst programmers, but it has it’s faults and Subversion (read Sub-version) addresses those known and perceived about CVS. I talked to Garrett Rooney about his book Practical Subversion, his contributions to the Subversion code and where Subversion fits into the scheme of your administration and development environments.

Practical SubversionI see from the book you are a strong believer in version control - can you summarize the main benefits of version control?

I like to think of version control as a way of communicating information between developers.

When you commit a change to a source tree you can think of it as an automated way of telling every other developer how they can fix the same problem in their source tree. The benefits go further though, since in addition to keeping everyone on the team up to date with the latest fixes, you’re also recording all of the history. This means that later on, when you want to figure out how a piece of code got the way it is you can look at the series of changes (and hopefully the justification for the changes, if you’ve been good about writing log messages) that let to the current situation. Looking at that history is often the best way to understand why the code got the way it is, which means you’re less likely to make the same mistake twice when making new changes.

So version control is really a way to help you communicate, both with other people working on your project right now and with those working on it in the future.

There’s been a lot of discussion online about the benefits of Subversion compared to the previous preferred environment of CVS. How much better is Subversion?

I recently had to start using CVS again, after a rather long period of time where I’d only used either Subversion or Perforce, a commercial version control system. It never ceases to amaze me, whenever I go back to CVS, how irritating it is to use.

Let’s start with the basics. Lots of things in CVS are slow.

Specifically, lots of operations that I like to do fairly often (’cvs diff’ is the big one here) need to contact the repository in order to work, this means going out over a network, which means it’s pretty slow. In Subversion the equivalent command is lightning quick, since your working copy keeps a cached copy of each file, so it doesn’t have to contact the server in order to show you the difference between the file you started with and the new version you created.

There are other parts of CVS that are also quite slow when compared to Subversion. In CVS the act of tagging or branching your source tree requires you to make a small change to each and every file in the tree. This takes a lot of time for a large tree, and a noticable amount of disk space. In Subversion the equivalent operation takes a constant, and very small amount of time and disk space.

The other big improvement is the fact that in Subversion changes are committed to the source tree in an atomic fashion. Either the entire change makes it in or none of it does. In CVS you can get into a situation where you updated your working copy in the middle of a commit, resulting in you getting only half of the changes, and thus a broken source tree. In Subversion this doesn’t happen.

The same mechanism means that it’s much easier to talk about changes in Subversion than in CVS. In CVS if you have a change to five separate files in order to talk about it you need to talk about the individual change to each file. “I committed revision 1.4 of foo.c, 1.19 of bar.c, …” This means that if someone wants to look at the change you made to each file they have to go look at each individual file to do it. In Subversion you just say “I committed revision 105″, and anyone who wants to look at the diff can just say something like “svn diff -r104:105″ to see the difference between revision 104 and revision 105 of the entire tree. This is also quite useful when merging changes between branches, something that’s quite difficult in CVS.

Finally, the user interface provided by the Subversion client is simply nicer than the one provided by CVS. It’s more consistent, and generally easier to use. Enough things are similar to CVS that a CVS user can easily get up to speed, but the commands generally make sense to a new user, as compared to those of CVS which can be rather confusing.

How does Subversion compare with version controls other than CVS, BitKeeper for example has been in the news a lot recently. How about commercial products, like Visual SourceSafe or ClearCase?

I’ve personally never used BitKeeper, largely because of its license. While BK was available under a “free as in beer” license for use in developing open source software the license prohibited users from working on competing products, like Subversion. As a result I’ve never really had a chance to try it out.

I do think that BitKeeper has some interesting ideas though, and the other distributed version control systems (Arch, Darcs, Bazaar-NG, etc) are all on my radar. I don’t know if I’m convinced of their advantages over centralized systems like Subversion, but there is interesting work being done here. Personally, of the three distributed systems I just mentioned I’m most interested in Bazaar-NG (http://bazaar-ng.org/).

As for the commercial products out there, I’ve had personal experience with Perforce and SourceSafe. I wasn’t impressed with SourceSafe at all, and I really can’t think of a situation where I’d use it willingly. Perforce on the other hand is a very nice system. Its branching and merging support is superior to what Subversion provides at the moment (although the Subversion team has plans to close that gap in the future). That said, Perforce is expensive, and unless you really need specific features that can only be found there I wouldn’t see much reason to go with it.

You sound like you’ve had a lot of personal experience of where the right source control mechanism has saved your life. Any true tales that might help highlight the benefits of version control?

Personally, my most memorable experiences where version control would have been a lifesaver are those from before I started making use of it on a daily basis.

I know that back in college there were several times where I was working late at night on some project, usually due within a few hours, and I managed to screw things up badly. It’s remarkable how easy it is to go from a version of a program that’s mostly working to one that’s totally screwed up, all while trying to fix that last bug. It’s especially bad when your attempt to fix that last bug introduces more problems, and you can know longer remember exactly what you changed.

With a version control system, you never really need to be in that situation. At the absolute worst, you can always roll back to the version of the code you had at your last commit. It’s impossible to get stuck in that situation where you can’t figure out what you changed because the system will remember for you.

Now that all of my non-trivial projects (and most of my trivial ones honestly) make use of version control I just don’t find myself in those kind of situations anymore.

Existing developers will almost certainly need to migrate to Subversion - how easy is this?

It’s easy to underestimate the problems that come with migrating from one version control system to another. Technically, a conversion is usually pretty straitforward. There are various different programs available to migrate your data (cvs2svn for CVS repositories, p42svn for Perforce, and others), and in many cases it can be tempting to just run the conversion, toss your users some documentation and away you go.

Unfortunately, it isn’t that simple. Version control becomes part of a developer’s day to day workflow, and changing something like that has consequences. There needs to be careful planning and most importantly you need to have buy in from the people involved.

Another Subversion developer, Brian Fitzpatrick, will actually be giving a talk about this very subject at OSCON this year, and I’m looking forward to hearing what he has to say.

http://conferences.oreillynet.com/cs/os2005/view/e_sess/6750

Some versioning systems have problems with anything other than text. What file types does Subversion support?

Subversion, by default, treats all files as binary. Over the network, and within the repository, files are always treated as binary blobs of data. Binary diff algorithms are used to efficiently store changes to files, and in a very real sense text and binary files are treated identically.

Optionally, there are various ways you can tell Subversion to treat particular files as something other than binary.

If you want end-of-line conversion to be performed on a file, for example so it could show up as a DOS style file when checked out on windows but a Unix style file when checked out on a Unix machine all you have to do is set the svn:eol-style property on the file

Similarly, if you want keyword substitution to be performed on a file, so words like $Revision$ or $Date$ to be replaced with the revision and date the file was last changed on, you can set the svn:keywords property to indicate that.

The key fact to keep in mind is that in Subversion these are optional features that are turned off by default. By their very nature they require that Subversion make changes to your files, which can be catestrophic in some cases (changing all the \r\n’s in a binary file to \n’s isn’t likely to work very well), so you need to ask for this behavior if you want it. In systems like CVS these kind of features are turned on by default, which has resulted in countless hours of pain for CVS users over the years.

From reading your book, it’s obvious that Subversion seems a little bit more application friendly than CVS, integrating with Apache, emacs and others with a little more grace than CVS and RCS. Is that really the case?

Well, let’s be fair, CVS and RCS have quite good integration with various tools, ranging from Emacs to Eclipse. That said, it hasn’t been easy to get to that point. In many cases tools that want to integrate with CVS have to jump through hoops to call out to the command line client and parse the resulting output, which can be fragile. In cases where that isn’t possible many projects have reimplemented CVS so that it could be more easily integrated.

In Subversion many of these problems are aleviated by the fact that the core functionality is implemented as a collection of software libraries. If you want to make use of Subversion in your own code all you need to do is link against the Subversion libraries and you can provide exactly the same functionality as the official Subversion client. If you’re not working in C or C++ there are probably bindings for the Subversion libraries written in your language of choice, so you can even do this without having to learn a lot about the C level libraries.

Additionally, Subversion’s ability to integrate with Apache has provided a number of abilities, ranging from WebDAV integration to the ability to use an SQL or LDAP database for storing usernames and passwords, that otherwise would have been incredibly difficult to implement. By working within the Apache framework we get all of that for free.

Subversion includes autoversioning support for DAV volumes, could you explain how you could use that to your advantage?

DAV autoversioning is most useful when you need to allow non-technical users, who would be uncomfortable making use of normal Subversion clients, to work with your versioned resources. This could mean source code, but most commonly it involves graphics files or word documents and other things like that. Your users simply use a DAV client (which is often built into their operating system) to access the files, and the version control happens transparently, without them even knowing about it. When they save their changes to the file it is automatically committed to the repository. This is a very powerful tool, and can give you some of the advantages of version control without costly training for your users.

Most people use version control for their development projects, but I’ve also found it useful for recording configuration file changes. Is that something you would advocate?

Absolutely! I personally keep much of my home directory under Subversion’s control, allowing me to version my editor’s config files, .bashrc, et cetera. All the same benefits you can get from using version control with software development are just as applicable for configuration files.

How important do you think it is for high quality tools like Subversion to be Open Source?

I’m a big fan of using open source licensing and development models for infrastructure level software (operating systems, compilers, development tools, etc). Well, honestly I’m a big fan of using open source licenses and development models for most kinds of software, but I think it’s particularly appropriate for software at the infrastructure level, where the primary use case is building larger systems.

It’s difficult to imagine a company being able to make money developing a new operating system in this day and age, or a version control system, or a C runtime library. These are largely commoditized parts of the software ecosystem, and as a result I think it makes sense for the various people who benefit from having them available to share the cost for producing them, and the best way we currently have to do that is via open source.

Additionally, it’s difficult to underestimate the value of having high quality systems out there for people to learn from. I’ve learned a great deal by reading open source code, and even more by participating in open source projects.

Finally though, I just like the fact that if I find a problem in a piece of open source software like Subversion I can actually do something about it. I’ve worked with closed source third party software in the past, and I’ve found that I tend to spend a lot of time digging through inadequate documentation and beating my head against the wall while trying to work around bugs. With an open source product you can at least make an attempt to figure out what the actual problem is.

Contributing to Subversion in your spare time doesn’t seem like a very relaxing way to spend your free time. Is there something, less, computer based that you like to do?

I don’t know, there’s something fun about working on open source projects. It’s awfully nice to have the freedom to do things the “right” way, as opposed to the “get it done right now” way, which happens far too often in the commerical software world.

That said, I do try to get off of the computer from time to time. I see a lot of movies, read a lot, and lately I’ve been picking up photography. I also just moved to Silicon Valley, so I’m making a concerted effort to explore the area.

Anything else in the pipeline you’d like to tell us about?

Well, Subversion 1.2 is on its way out the door any day now, and that’ll bring with it some great new features, primarily support for locking of files, something that many users had been requesting.

As for my other projects, I’m going to be giving a talk at O’Reilly’s OSCON again in August. This year I’ll be speaking about the issues regarding backwards compatibility in open source software. I’ve also been spending a lot of time on the Lucene4c project (http://incubator.apache.org/lucene4c/), trying to provide a C level API to access the Apache Lucene search engine.

Garrett Rooney Bio

Garrett Rooney works for Ask Jeeves, in Los Gatos CA, on Bloglines.com. Rooney attended Rensselaer Polytechnic Institute, where he managed to complete 3 years of a mechanical engineering degree before coming to his senses and realizing he wanted to get a job where someone would pay him to play with computers. Since then, Rooney completed a computer science degree at RPI and has spent far too much time working on a wide variety of open source projects, most notably Subversion.

Computerworld Blogs

Computerworld have set up a dedicated blogging area on their site at Computerworld blogs

There are a few of us there; all dedicated to blogging on different news stories in a range of different areas and topics. You can read my blog at the dedicated Martin MC Brown Computerworld blog.

Alternatively, you can subscribe to my dedicated RSS feed.

You can see that we’ve been populated it over the last week or so; there are already blog posts from me, and others, about a variety of topics.

Please feel free to read and either comment there, or here and let me know how I’m getting on.