Tag Archives: MCslp Coalface

LOSUG Presentation Slides Now Available

My presentation at LOSUG on tuesday went down like a house on fire – I think it would be safe to say that the phrase for the evening was ‘It’s a cache!’.

For that to make sense, you need to look at the slides, which are now available here.

Attendance was great, but it seems the last minute change of day meant that some people missed the session. We had 151 people register, and about 80 turned up on the night.

LOSUG January: MySQL/DTrace and Memcached

Next Tuesday (27th Jan), I’ll be speaking at the London OpenSolaris User Group again. For those that follow the LOSUG meetings, we normally the third thursday of the month, but due to the overwhelming popularity of the event this month (more 100 registrations so far) we have had to push the event back to the last Tuesday of the month.

This month, I’ll be talking about the DTrace probes that we have added into MySQL and demonstrating their use and functionality. Along the way I’ll also cover some of the internals of MySQL and how it works (and how they relate to the DTrace probes we’ve added), how to use the probes to analyze and diagnose your queries, and how I’ve already used the DTrace probes to provide information up to the Query Analysis functionality within Enterprise Monitor.

After we’ve looked at performance monitoring and optimization with DTrace, I’ll then demonstrate how to get a little more performance out of your application using MySQL by taking advantage of the memory cache offered by Memcached.

As before, I’ll provide a link to the finished presentation once I’ve demonstrated everything to the LOSUG folks on tuesday. If you happen to be in London, and don’t already have plans of some sort, please feel free to come along – food and drink will be provided after the session and I’ll be around for as long as I can for any questions.

I love my Moleskine

Not remotely related to computing at all, but I’ve just been updating my diaries, and I use Moleskine. I go everywhere with a Moleskine of some description (they’ve recently released some really tiny notebooks, which make it easier to carry your life around with you).

Despite having computers, organization tools, and email, there is something decidedly comforting about writing, by hand, into a physical notebook.

Of course, you write in pencil so that the contents don’t smudge, and somehow that just feels even better.

Multiple VCS Updates and Cleanups

I spend a lot of time updating a variety of different repositories of different varieties and denominations, and I hate having to do that all by hand – I’d rather just go up into a top-level directory and say update-all and let a script work out what to do, no matter what different repos are there.

I do it with a function defined within my bash profile/rc scripts, and it covers git, bzr, svn, bk, and cvs. The trick is to identify what type of directory we are updating. I do this, lazily, for each type individually, rather than for each directory, but I’ve found this method to be more reliable.

update-all ()
{
for file in `ls -d */.svn 2>/dev/null`;
do
realdir=`echo $file|cut -d/ -f1`;
echo Updating in $realdir;
( cd $realdir;
svn update );
done;
for file in `ls -d */.bzr 2>/dev/null`;
do
realdir=`echo $file|cut -d/ -f1`;
echo Updating in $realdir;
( cd $realdir;
bzr pull );
done;
for file in `ls -d */.git 2>/dev/null`;
do
realdir=`echo $file|cut -d/ -f1`;
echo Updating in $realdir;
( cd $realdir;
git pull );
done;
for file in `ls -d */CVS 2>/dev/null`;
do
realdir=`echo $file|cut -d/ -f1`;
echo Updating in $realdir;
( cd $realdir;
cvs up );
done;
for file in `ls -d */BitKeeper 2>/dev/null`;
do
realdir=`echo $file|cut -d/ -f1`;
echo Updating in $realdir;
( cd $realdir;
bk pull );
done;
unset realdir
}

That’s it – a quick way to update any directory of repos.

Feeding Query Analyzer from DTrace

One of the new features in the new release of MySQL Enterprise Monitor is Query Analyzer. As the name suggests, the Query Analyzer provides information about the queries that are running on your server, the response times and row and byte statistics. The information provided is great, and it doesn’t take very long to see from the query data supplied that there are places where you could improve the the query, or even reduce the number of queries that you submit.

The system works by using the functionality of the MySQL Proxy to monitor the queries being executed and then provide that information up to the MySQL Enterprise Service Manager so that the information can be displayed within the Query Analyzer page. To get the queries monitored, you have to send the queries through the agent which both monitors their execution and sends the information on up to the Manager, along with all the other data being monitored.

The team, though, have been a bit clever and opened up the system to allow information to be sent to the Manager using a REST interface. This means that any system capable of providing information that you want to monitor can be sent up to the Manager. Of course, you can’t just send anything, the Manager needs to know how to handle it, but it shows the flexibility of the design and the potential for the future.

So how does this help us?

Well, one of the new features in MySQL 6.0 that I’ve been working on (with Mikael Ronstrom and Alexey Kopytov) is DTrace probes. We’ve added a bunch of static DTrace probes into MySQL 6.0 (the full set will appear in MySQL 6.0.8, I think) designed to let you monitor the execution of queries within the server. The probes will allow you to see both the top-level information, such as overall execution time, but also deeper so that you can get information about individual row operations, whether the query used the query cache, and whether it used a filesort operation.

I haven’t finished the DTrace probes documentation yet, but I have been demonstrating the probes at conferences and talks (including my MySQL on OpenSolaris university session this week). Trust me, you’ll be pleased. I’ve got a separate blog post detailing some of the specifics in the works at the moment.

For obvious reasons, there’s a synergy here that should be obvious. Why don’t we feed up data extracted using DTrace and provide that up to the Enterprise Manager?

To do this, there are two parts to the process, the DTrace probes and the script hat passes that information up in a suitable format to the manager.

The D script is quite straightforward, we initialize the structures, populate the core information that we need (query string, bytes, rows and the time), and the use the remainder of the probes to finalize that information. Let’s have look at the script and then go through the detail:

#!/usr/sbin/dtrace -s

#pragma D option quiet

mysql*:::query-start
{
   self->query = copyinstr(arg0);
   self->db    = copyinstr(arg2);
   self->rows  = 0;
   self->querystart = timestamp;
   self->bytes = 0;
}

mysql*:::select-done
{
        self->rows = arg1;
}

mysql*:::insert-done
{
        self->rows = arg1;
}

mysql*:::update-done
{
        self->rows = arg2;
}

mysql*:::multi-delete-done
{
        self->rows = arg1;
}

mysql*:::delete-done
{
        self->rows = arg1;
}

mysql*:::multi-update-done
{
        self->rows = arg2;
}

mysql*:::net-write-start
{
        self->bytes = self->bytes + arg0;
}

mysql*:::query-done
/self->query != NULL/
{
        printf("%s:%s:%d:%d:%d\n",
        self->query,
        self->db,
        ((timestamp - self->querystart)/1000),
        self->rows,self->bytes);
}

First, we set a pragma to quieten down the output so that the DTrace script only reports what we explicitly write out:

#pragma D option quiet

In DTrace, the individual execution points are called probes, and probes are triggered each time that point in the code is reached. To specify the probes we want to watch for, you use a special format, provider:module:function:name that identifies the probe by the name of the provider (the application), the module, the function, and the probe, each separated by a colon. We can just specify the provider and probe name, like mysql*:::query-start.

It should also be noted that probes are often provided in pairs at the start and end of an operation, so you can identify the start and end of a query by looking for the query-start and query-done probes.

The DTrace probes in the server are set-up in a sort of nested structure, going deeper into the query process as needed. Although not at the very top of the execution cycle, the start of the main query processing is identified by the query-start probe. Each time a query is submitted to MySQL, this probe will get triggered, so for us, it is the start of the process. The probe has a number of arguments, but for our purposes we only need the first (arg0), which contains the full query string, and the third (arg2) which contains the name of the database that the query was executed against.

We also initialize the row and byte counts, and the time when the query was executed using the built-in timestamp value. All of this information is placed into the special self structure, which is a persistent structure used to share information between the individual probes that get fired during execution.

mysql*:::query-start
{
   self->query = copyinstr(arg0);
   self->db    = copyinstr(arg2);
   self->rows  = 0;
   self->querystart = timestamp;
   self->bytes = 0;
}

To get the counts of the number of rows, we can’t get the information from the query-done probe. This is because different operations actually provide different levels of information. For example, the select-done and insert-done just provide a count of the rows. But the update-done probe provides information both about the number of rows that matched the original WHERE clause, and the count of the number rows actually modified.

To record the number of the rows modified by the query, we therefore need to pull out each piece of information individually:

mysql*:::select-done
{
        self->rows = arg1;
}
mysql*:::insert-done
{
        self->rows = arg1;
}

mysql*:::update-done
{
        self->rows = arg2;
}

mysql*:::multi-delete-done
{
        self->rows = arg1;
}

mysql*:::delete-done
{
        self->rows = arg1;
}

mysql*:::multi-update-done
{
        self->rows = arg2;
}

For the bytes retrieved by each query, the information is a bit more difficult to identify. I’m going to cheat a bit and use the bytes sent by mysqld during a net write to the client. There is a limitation here I’ve skipped, which is that we could report data sent to any client, since I haven’t bothered to track connection IDs. I could do this, but it would make the script a little more complicated. Since the net-write-start might be called multiple times for a long query, we calculate a cumulative byte count.

mysql*:::net-write-start
{
        self->bytes = self->bytes + arg0;
}

That’s all of the information collection; now we just need to print out the information when the query completes. We do this by writing out a colon separated list of the information that we’ve collected. One additional point here though is that to calculate the duration of the query, you take the timestamp recorded when query-start was called away from the current timestamp.

Timestamp information is recorded in nanoseconds (yes, you read that right, nanoseconds), so we divide it by a thousand to get it in microseconds, which is what the Enterprise Manager will expected.

mysql*:::query-done /self->query != NULL/ { printf("%s:%s:%d:%d:%d\n", self->query, self->db, ((timestamp - self->querystart)/1000), self->rows,self->bytes); }

If you run this script on it’s own (against a MySQL running on Solaris/OpenSolaris, with probes, of course), then you’ll get output like this:

SELECT DATABASE()::391:1:44
show databases:test:947:2:84
show tables:test:2018:3:74
select * from t limit 5:test:595:5:51

To provide the information up to the Enterprise Manager we cannot use D scripts. Instead, a wrapper around the D script will read the raw information produced and then pass that up to the Enterprise Manager.

Before we look at that process, it is worth looking at the REST API that has been built in to v2 of the Enterprise Monitor. The interface is available through the standard URL for the Enterprise service, typically your hostname and the port 18080 if you’ve used the default settings. Therefore we can access the interface using the url http://nautilus:18080/v2/rest/, assuming our host is nautilus.

From the base URL, you can start to get information, or put information, about the different entries in the repository using the path in the URL to signifiy what it is we are looking for. Information about instances is within the instance, with the provider as mysql, and the MySQL server as server. Or better put, the base URL would be http://nautilus:18080/v2/rest/instance/mysql/server/.

The last fragment of information we need is the UUID. All objects within the repository have a unique ID, and these are split at different levels. For example, an agent has a UUID, and so does the server it is monitoring. In our example, we want the UUID of the MySQL server, which is conveniently stored within the server itself in the mysql.inventory table.

Finally, we need the username and password of the agent user. Through the REST API we use basic HTTP authentication, to make the process easy.

Putting all of this together, we can get the core information about an instance using wget:

$ wget -qO mysql.server --http-user=agent --http-password=password \
    'http://nautilus:18080/v2/rest/instance/mysql/server/2b86b277-fb2b-492d-b946-3a2acaec0869'

If we now look at the output file, mysql.server:

{
    "name": "2b86b277-fb2b-492d-b946-3a2acaec0869",
    "parent": "/instance/os/Host/ssh:{88:e1:fc:6d:99:69:e4:5f:b4:0a:ec:5a:09:c0:6a:24}",
    "values":     {
        "blackout": "false",
        "displayname": null,
        "registration-complete": "true",
        "repl.groupName": null,
        "server.connected": 1,
        "server.last_error": null,
        "server.reachable": 1,
        "transport": "a3113263-4993-4890-8235-cadef9617c4b",
        "visible.displayname": "bear:3306"
    }
}

I wont go into detail about what is here, most of it should be self explanatory. However, there are a few things of note. First, the information is in JSON format. This makes it easy to read and more importantly create.

Second, note the notation. The item is identified by its name, and also by it’s parent. This is an important construct because it helps identify the different elements with each other. In this case, the MySQL server is associated with a physical host (/instance/os/Host) and the individual host is identified by a SSH key, which is one of the alternative UUID formats support by the Enterprise Server to identify individual entities.

When submitting information, we need to flip the process around. We don’t use a GET request to obtain the information, we use a PUT to send up a JSON packet containing the information we want. The URL for sending the information depends on what we are uploading. The main element for the statements used for Query Analyzer is the statementsummary.

The URL for this is http://nautilus:18080/v2/rest/instance/mysql/statementsummary/. For the identifier at the end of the URL, you use a period-separated list that includes the UUID of the MySQL server, the name of the MySQL database the SQL statement relates to, and an MD5 hash of the SQL statement text.

For the actual packet, we use the following format, taken here from the Perl script:

{
    "name": "$server_uuid.$quanbase->{dbname}.$md5",
    "parent": "/instance/mysql/server/$server_uuid",
    "values" : {
	"count": "$quanbase->{count}",
	"text": "$quanbase->{query}",
	"query_type": "$quanbase->{qtype}",
	"text_hash": "$md5",
	"max_exec_time": "$quanbase->{max_exec_time}",
	"min_exec_time": "$quanbase->{min_exec_time}",
	"exec_time": "$quanbase->{exec_time}",
	"rows": "$quanbase->{rows}",
	"max_rows": "$quanbase->{max_rows}",
	"min_rows": "$quanbase->{min_rows}",
	"database": "$quanbase->{dbname}",
	"bytes": "$quanbase->{bytes}",
	"max_bytes": "$quanbase->{max_bytes}",
	"min_bytes": "$quanbase->{min_bytes}",
    }
}

Most of this should be self-explanatory. Remember that this is a statement summary, which means that we can send up information about multiple invocations of the same statement in one packet. Thus, within the statementsummary packet we have information about the count of invocations of the statement, execution, row and byte counts and maximum/minimum of each of them, and then the core information like the actual query text, database name, and query type (SELECT, INSERT, etc).

Once again, note the name and parent. Here the name is the same tuple as used in the URL, the UUID of the MySQL server, the database, and the hash of the query. This is used as the identifier for this query within the repository and allows us to uniquely identify the query, and the query execution on this server. The parent is the location of, and UUID of, the MySQL server.

Now, the Perl script that collates the information from our D script has to do two things, first read the raw output that we create with the D script, and second, supply this up as a PUT request to the Enterprise Server.

Dealing with the latter part first, I’ve used Perl and LWP (libwww-perl) module to construct a suitable request object with the HTTP authorization attached:

my $header = HTTP::Headers->new;
$header->content_type('text/text');
$header->authorization_basic('agent','password');
my $res = LWP::UserAgent->new();

Once we’ve constructed a packet, sending it is a case of specifying the URL, the header, and the content:

$header->content_length(length $bio);
my $req = HTTP::Request->new(PUT => $url, $header, $bio);

$res->request($req);

The bulk of the rest of the script is devoted to reading the information from the D script output, and assembling the packet and min/max values per query.

Within the Query Analyzer, the SQL statements are normalized, or canonicalized so that variables are replaced with a question mark. This ensures that we are tracking the query and not the individual values. The significance here is that we want to compare the raw SQL statement, of which there may only be a few hundred in a typical application, not each individual query with it’s WHERE and other clauses.

Hence, the statement:

SELECT photoid,title from media_photos where photoid > 23785 limit 15

Would be normalized to:

SELECT photoid,title from media_photos where photoid > ? limit ?

For the Perl script, I do just one type of normalization, removing the value from a LIMIT clause.

#!/usr/bin/perl
use Data::Dumper;
use LWP;
use HTTP::Request;
use Digest::MD5 qw/md5_hex/;

my $server_uuid = '2b86b277-fb2b-492d-b946-3a2acaec0869';

my $header = HTTP::Headers->new;
$header->content_type('text/text');
$header->authorization_basic('agent','password');
my $res = LWP::UserAgent->new();

my $interval = shift || 20;

print "Sending queries every $interval statement(s)\n";

open(DTRACE,"./merlin.d|") or die "Couldn't open DTRACE\n";

my $counter = 1;
my $querybase = {};

while(
)
{
    chomp;
    my ($origquery,$dbname,$time,$rows,$bytes) = split m{:};

    my $query = $origquery;
    $query =~ s/limit \d+/limit ?/g;

    $querybase->{$query}->{dbname} = $dbname;
    $querybase->{$query}->{query} = $query;
    $querybase->{$query}->{count}++;
    $querybase->{$query}->{rows} += $rows;
    $querybase->{$query}->{bytes} += $bytes;
    $querybase->{$query}->{exec_time} += $time;

    if (exists($querybase->{$query}))
    {
	$querybase->{$query}->{max_rows} = $rows if ($rows > $querybase->{$query}->{max_rows});
	$querybase->{$query}->{min_rows} = $rows if ($rows < $querybase->{$query}->{min_rows});
	$querybase->{$query}->{max_bytes} = $bytes if ($bytes > $querybase->{$query}->{max_bytes});
	$querybase->{$query}->{min_bytes} = $bytes if ($bytes < $querybase->{$query}->{min_bytes});
	$querybase->{$query}->{max_exec_time} = $time if ($time > $querybase->{$query}->{max_exec_time});
	$querybase->{$query}->{min_exec_time} = $time if ($time < $querybase->{$query}->{min_exec_time});
    }
    else
    {
	$querybase->{$query}->{max_rows} = $rows;
	$querybase->{$query}->{min_rows} = $rows;
	$querybase->{$query}->{max_bytes} = $bytes;
	$querybase->{$query}->{min_bytes} = $bytes;
	$querybase->{$query}->{max_exec_time} = $time;
	$querybase->{$query}->{min_exec_time} = $time;
    }	

    if (($counter % $interval) == 0)
    {
	print STDERR "Writing quan packets ($counter queries sent)\n";
        foreach my $query (keys %{$querybase})
        {
            send_quandata($querybase->{$query});
	    delete($querybase->{$query});
        }
    }
    $counter++;
}

sub send_quandata
{
    my ($quanbase) = @_;

    my $urlbase = 'http://nautilus:18080/v2/rest/instance/mysql/statementsummary/%s.%s.%s';

    my $md5 = md5_hex($quanbase->{query});
    my $url = sprintf($urlbase,$server_uuid,$quanbase->{dbname},$md5);

my $bio = < {dbname}.$md5",
    "parent": "/instance/mysql/server/$server_uuid",
    "values" : {
        "count": "$quanbase->{count}",
        "text": "$quanbase->{query}",
        "query_type": "$quanbase->{qtype}",
        "text_hash": "$md5",
        "max_exec_time": "$quanbase->{max_exec_time}",
        "min_exec_time": "$quanbase->{min_exec_time}",
        "exec_time": "$quanbase->{exec_time}",
        "rows": "$quanbase->{rows}",
        "max_rows": "$quanbase->{max_rows}",
        "min_rows": "$quanbase->{min_rows}",
        "database": "$quanbase->{dbname}",
        "bytes": "$quanbase->{bytes}",
        "max_bytes": "$quanbase->{max_bytes}",
        "min_bytes": "$quanbase->{min_bytes}",
    }
}
EOF

$header->content_length(length $bio);
my $req = HTTP::Request->new(PUT => $url, $header, $bio);

$res->request($req);
}

The basic structure is:

  1. Open the DTrace script
  2. Read a line
  3. Add that to the temporary list of queries I know about, adding stats
  4. When I’ve read N queries, send up the stats about each query as a JSON packet to the Enterprise Manager
  5. Repeat

Depending on how busy your server is, you may want to adjust the interval when the stats data is uploaded. The default is every 20 queries, but when running on a really busy server, or when running benchmarks, you might want to up that to prevent the script spending too much time sending fairly small packets of stats up.

If you run the script, it should just work in the background:

$ ./dtrace_merlin.pl
Sending queries every 20 statement(s)
Writing quan packets (20 queries sent)
Writing quan packets (40 queries sent)
Writing quan packets (60 queries sent)

That’s it!

I set this up and then sent some random queries to the server. The following graphic shows the query data only from the DTrace sourced information.

There are some limitations to the current script. I don’t do full normalization, for example, and I dont send the detailed information about individual statements up at the moment. There is also an EXPLAIN packet that you can send that contains the output from an EXPLAIN on a long running query. I could do that by opening a connection to the server and picking out the information.

But what I’d really like to do is use the DTrace-based output to show the detail of each part of the query process and the EXPLAIN output. I’m sure I can work on that with the Enterprise team.

MySQL on OpenSolaris Presentation/Transcript Now Available

As I mentioned earlier this week, I did a presentation on MySQL in OpenSolaris today.

The presentation (audio and slides) is now viewable online (and downloadable), and you can also get hold of the transcript of the questions: here (or download). The original presentation is here.

One minor difference from the presentation is that we have upgraded MySQL to 5.0.67 in 2008.11. I had forgotten we’d agreed to do this after the 5.1 pushback. Thanks to Matt Lord for the heads up.

And thanks to everybody for attending. Up next week, memcached!

MySQL University: MySQL and OpenSolaris

On Thursday, November 13, 2008 (14:00 UTC / 14:00 BST / 15:00 CET), I’ll be presenting a MySQL University session on MySQL and OpenSolaris.

The presentation will be similar to the presentation I did at the London OpenSolaris Users Group in July, you can see that presentation by visiting the LOSUG: July 2008 page.

The presentation on thursday will be slightly different - I’ll be providing a bit more hands-on information about how to install MySQL, how to configure and change the configuration and some more detail on solutions like the Webstack and Coolstack distributions.

I’ll also cover our plans for the inclusion of MySQL 5.1 in OpenSolaris, which will happen next year, and provide some examples on the new DTrace probes that we have been adding to MySQL generally.

Of course, if there’s anything specific you want me to talk about, comment here and I’ll see if I can squeeze it into the presentation before thursday.

Compiling MySQL Workbench on Gentoo

The Workbench team have just announced the release of Workbench for Linux, including binary packages and source packages with instructions on how to build.

I’m a Gentoo Linux user, so I prefer building from source, and you’ll need to emerge the following packages (and note the USE) requirement as part of the source build process:

# USE="svg" emerge libzip libxml2 libsigc++ \
    libglade libgtksourceviewmm media-libs/glut mysql lua \
    ossp-uuid libpcre libgnome gtk+ pango cairo

Depending on your config and platform, you may need to bypass some package masking by adding the packages to your /etc/portage/package.keywords file.

Then download and install the ctemplate library from google code page. The current Gentoo version is 0.90, and you really should install the 0.91 version.

With the required packages and libraries in place, download the Workbench sources and then build:

# cd mysql-workbench-5.1.4alpha
# ./autogen.sh
# make
# make install

That should build and install MySQL Workbench for you.

Just to confirm, here’s a screenshot of the built Workbench running on Gentoo Linux and displaying to my Mac OS X-based desktop.

ZFS Replication for MySQL data

At the European Customer Conference a couple of weeks back, one of the topics was the use of DRBD. DRBD is a kernel-based block device that replicates the data blocks of a device from one machine to another. The documentation I developed for that and MySQL is available here.

Fundamentally, with DRBD, you set up a physical device, configure DRBD on top of that, and write to the DRBD device. In the background, on the primary, the DRBD device writes the data to the physical disk and replicates those changed blocks to the seconday, which in turn writes the data to it’s physical device. The result is a block level copy of the source data. In an HA solution, which means that you can switch over from your primary host to your secondary host in the event of system failure and be sure pretty certain that the data on the primary and seconday are the same.

In short, DRBD simplifies one of the more complex aspects of the typical HA solution by copying the data needed during the switch. Because DRBD is a Linux Kernel module you can’t use it on other platforms, like Mac OS X or Solaris. But there is another solution: ZFS.

ZFS supports filesystem snapshots. You can create a snapshot at any time, and you can create as many snapshots as you like.

Let’s take a look at a typical example. Below I have a simple OpenSolaris system running with two pools, the root pool and another pool I’ve mount at /opt:

Filesystem             size   used  avail capacity  Mounted on
rpool/ROOT/opensolaris-1
                       7.3G   3.6G   508M    88%    /
/devices                 0K     0K     0K     0%    /devices
/dev                     0K     0K     0K     0%    /dev
ctfs                     0K     0K     0K     0%    /system/contract
proc                     0K     0K     0K     0%    /proc
mnttab                   0K     0K     0K     0%    /etc/mnttab
swap                   465M   312K   465M     1%    /etc/svc/volatile
objfs                    0K     0K     0K     0%    /system/object
sharefs                  0K     0K     0K     0%    /etc/dfs/sharetab
/usr/lib/libc/libc_hwcap1.so.1
                       4.1G   3.6G   508M    88%    /lib/libc.so.1
fd                       0K     0K     0K     0%    /dev/fd
swap                   466M   744K   465M     1%    /tmp
swap                   465M    40K   465M     1%    /var/run
rpool/export           7.3G    19K   508M     1%    /export
rpool/export/home      7.3G   1.5G   508M    75%    /export/home
rpool                  7.3G    60K   508M     1%    /rpool
rpool/ROOT             7.3G    18K   508M     1%    /rpool/ROOT
opt                    7.8G   1.0G   6.8G    14%    /opt

I’ll store my data in a directory on /opt. To help demonstrate some of the basic replication stuff, I have other things stored in /opt as well:

total 17
drwxr-xr-x  31 root     bin           50 Jul 21 07:32 DTT/
drwxr-xr-x   4 root     bin            5 Jul 21 07:32 SUNWmlib/
drwxr-xr-x  14 root     sys           16 Nov  5 09:56 SUNWspro/
drwxrwxrwx  19 1000     1000          40 Nov  6 19:16 emacs-22.1/
lrwxrwxrwx   1 root     root          48 Nov  5 09:56 uninstall_Sun_Studio_12.class -> SUNWspro/installer/uninstall_Sun_Studio_12.class

To create a snapshot of the filesystem, you use zfs snapshot, and then specify the pool and the snapshot name:

# zfs snapshot opt@snap1

To get a list of snapshots you’ve already taken:

# zfs list -t snapshot
NAME                                         USED  AVAIL  REFER  MOUNTPOINT
opt@snap1                                       0      -  1.03G  -
rpool@install                               19.5K      -    55K  -
rpool/ROOT@install                            15K      -    18K  -
rpool/ROOT/opensolaris-1@install            59.8M      -  2.22G  -
rpool/ROOT/opensolaris-1@opensolaris-1       100M      -  2.29G  -
rpool/ROOT/opensolaris-1/opt@install            0      -  3.61M  -
rpool/ROOT/opensolaris-1/opt@opensolaris-1      0      -  3.61M  -
rpool/export@install                          15K      -    19K  -
rpool/export/home@install                     20K      -    21K  -

The snapshots themselves are stored within the filesystem metadata, and the space required to keep them will vary as time goes on because of the way the the snapshots are created. The initial creation of a snapshot is really quick, because instead of taking an entire copy of the data and metadata required to hold the entire snapshot, ZFS merely records the point in time and metadata of when the snaphot was created.

As you make more changes to the original filesystem, the size of the snapshot increases because more space is required to keep the record of the old blocks. Furthermore, if you create lots of snapshots, say one per day, and then delete the snapshots from earlier in the week, the size of the newer snapshots may also increase, as the changes that make up the newer state have to be included in the more recent snapshots, rather than being spread over the seven snapshots that make up the week.

The result is that creating snapshots is generally very fast, and storing snapshots is very efficient. As an example, creating a snapshot of a 40GB filesystem takes less than 20ms on my machine.

The only issue, from a backup perspective, is that snaphots exist within the confines of the original filesystem. To get the snapshot out into a format that you can copy to another filesystem, tape, etc. you use the zfs send command to create a stream version of the snapshot.

For example, to write out the snapshot to a file:

# zfs send opt@snap1 >/backup/opt-snap1

Or tape, if you are still using it:

# zfs send opt@snap1 >/dev/rmt/0

You can also write out the incremental changes between two snapshots using zfs send:

# zfs send opt@snap1 opt@snap2 >/backup/opt-changes

To recover a snapshot, you use zfs recv which applies the snapshot information either to a new filesytem, or to an existing one. I’ll skip the demo of this for the moment, because it will make more sense in the context of what we’ll do next.

Both zfs send and zfs recv work on streams of the snapshot information, in the same way as cat or sed do. We’ve already seen some examples of that when we used standard redirection to write the information out to a file.

Because they are stream based, you can use them to replicate information from one system to another by combining zfs send, ssh, and zfs recv.

For example, let’s say I’ve created a snapshot of my opt filesystem and want to copy that data to a new system into a pool called slavepool:

# zfs send opt@snap1 |ssh mc@slave pfexec zfs recv -F slavepool

The first part, zfs send opt@snap1, streams the snapshot, the second, ssh mc@slave, and the third, pfexec zfs recv -F slavepool, receives the streamed snapshot data and writes it to slavepool. In this instance, I’ve specified the -F option which forces the snapshot data to be applied, and is therefore destructive. This is fine, as I’m creating the first version of my replicated filesystem.

On the slave machine, if I look at the replicated filesystem:

# ls -al /slavepool/
total 23
drwxr-xr-x   6 root     root           7 Nov  8 09:13 ./
drwxr-xr-x  29 root     root          34 Nov  9 07:06 ../
drwxr-xr-x  31 root     bin           50 Jul 21 07:32 DTT/
drwxr-xr-x   4 root     bin            5 Jul 21 07:32 SUNWmlib/
drwxr-xr-x  14 root     sys           16 Nov  5 09:56 SUNWspro/
drwxrwxrwx  19 1000     1000          40 Nov  6 19:16 emacs-22.1/
lrwxrwxrwx   1 root     root          48 Nov  5 09:56 uninstall_Sun_Studio_12.class -> SUNWspro/installer/uninstall_Sun_Studio_12.class

Wow - that looks familiar!

Once you’ve snapshotted once, to synchronize the filesystem again, I just need to create a new snapshot, and then use the incremental snapshot feature of zfs send to send the changes over to the slave machine again:

# zfs send -i opt@snapshot1 opt@snapshot2 |ssh mc@192.168.0.93 pfexec zfs recv slavepool

Actually, this operation will fail. The reason is that the filesystem on the slave machine can currently be modified, and you can’t apply the incremental changes to a destination filesystem that has changed. What’s changed? The metadata about the filesystem, like the last time it was accessed - in this case, it will have been our ls that caused the problem.

To fix that, set the filesystem on the slave to be read-only:

# zfs set readonly=on slavepool

Setting readonly means that we can’t change the filesystem on the slave by normal means - that is, I can’t change the files or metadata (modification times and so on). It also means that operations that would normally update metadata (like our ls) will silently perform their function without attempting to update the filesystem state.

In essence, our slave filesystem is nothing but a static copy of our original filesystem. However, even when enabled to readonly, a filesystem can have snapshots applied to it. Now it’s read only, re-run the initial copy:

# zfs send opt@snap1 |ssh mc@slave pfexec zfs recv -F slavepool

Now we can make changes to the original and replicate them over. Since we’re dealing with MySQL, let’s initialize a database on the original pool. I’ve updated the configuration file to use /opt/mysql-data as the data directory, and now I can initialize the tables:

# mysql_install_db --defaults-file=/etc/mysql/5.0/my.cnf --user=mysql

Now, we can synchronize the information to our slave machine and filesystem by creating another snapshot and then doing an incremental zfs send:

# zfs snapshot opt@snap2

Just to demonstrate the efficiency of the snapshots, the size of the data created during initialization is 39K:

# du -sh /opt/mysql-data/
  39K	/opt/mysql-data

If I check the size used by the snapshots:

# zfs list -t snapshot
NAME                                         USED  AVAIL  REFER  MOUNTPOINT
opt@snap1                                     47K      -  1.03G  -
opt@snap2                                       0      -  1.05G  -

The size of the snapshot is 47K. Note, by the way, that it is 47K in snap1, because currently snap2 should be more or less equal to our current filesystem state.

Now, let’s synchronize this over:

# zfs send -i opt@snap1 opt@snap2|ssh mc@192.168.0.93 pfexec zfs recv slavepool

Note we don’t have to force the operation this time - we’re synchronizing the incremental changes from what are identical filesystems, just on different systems.

And double check that the slave has it:

# ls -al /slavepool/mysql-data/

Now we can start up MySQL, create some data, and then synchronize the information over again, replicating the changes. To do that, you have to create a new snapshot, then do the send/recv to the slave to synchronize the changes.

The rate at which you do it is entirely up to you, but keep in mind that if you have a lot of changes then doing it as frequently as once a minute may lead to your data becoming behind the because of the time taken to transfer the filesystem changes over the network - running snapshot with MySQL running in the background still takes comparatively little time.

To demonstrate that, here’s the time taken to create a snapshot mid-way through a 4 million row insert into an InnoDB table:

# time zfs snapshot opt@snap3

real    0m0.142s
user    0m0.006s
sys     0m0.027s

I told you it was quick :)

However, the send/recv operation took a few minutes to complete, with about 212MB of data transferred over a very slow network connection, and the machine was busy writing those additional records.

Ideally you want to set up a simple script that will handle that sort of snapshot/replication for you and run it past cron to do the work for you. You might also want to try ready-made tools like Tim Foster’s zfs replication tool, which you can find out about here. Tim’s system works through SMF to handle the replication and is very configurable. It even handles automatic deletion of old, synchronized, snapshots.

Of course, all of this is useless unless once replicated from one machine to another we can actually use the databases. Let’s assume that there was a failure and we needed to fail over to the slave machine. To do:

  1. Stop the script on the master, if it’s still up and running.
  2. Set the slave filesystem to be read/write:
    # zfs set readonly=off slavepool 
  3. Start up mysqld on the slave. If you are using InnoDB, Falcon or Maria you should get auto-recovery, if it’s needed, to make sure the table data is correct, as shown here when I started up from our mid-INSERT snapshot:
    InnoDB: The log sequence number in ibdata files does not match
    InnoDB: the log sequence number in the ib_logfiles!
    081109 15:59:59  InnoDB: Database was not shut down normally!
    InnoDB: Starting crash recovery.
    InnoDB: Reading tablespace information from the .ibd files...
    InnoDB: Restoring possible half-written data pages from the doublewrite
    InnoDB: buffer...
    081109 16:00:03  InnoDB: Started; log sequence number 0 1142807951
    081109 16:00:03 [Note] /slavepool/mysql-5.0.67-solaris10-i386/bin/mysqld: ready for connections.
    Version: '5.0.67'  socket: '/tmp/mysql.sock'  port: 3306  MySQL Community Server (GPL)
    

Yay - we’re back up and running. On MyISAM, or other tables, you need to run REPAIR TABLE, and you might even have lost some information, but it should be minor.

The point is, a mid-INSERT ZFS snapshot, combined with replication, could be a good way of supporting a hot-backup of your system on Mac OS X or Solaris/OpenSolaris.

Probably, the most critical part is finding the sweet spot between the snapshot replication time, and how up to date you want to be in a failure situation. It’s also worth pointing out that you can replicate to as many different hosts as you like, so if you want wanted to replicate your ZFS data to two or three hosts, you could.

MySQL on Solaris Best Practices Presentation

A couple of weeks ago I was at the MySQL European Customer Conference in London, where I was presenting my talk on deploying MySQL on Solaris best practices. You can download a copy of the presentation here: MySQL on Solaris Best Practices.

I cover both choosing the best release version, using tricks like mtmalloc (the threaded malloc library) before moving on to UFS and ZFS tricks, using DTrace and MySQL Cluster and Sun Cluster.