Category Archives: continuent

Cross your Fingers for Tech14, see you at OSCON

i

So I’ve submitted my talks for the Tech14 UK Oracle User Group conference which is in Liverpool this year. I’m not going to give away the topics, but you can imagine they are going to be about data translation and movement and how to get your various databases talking together.

I can also say, after having seen other submissions for talks this year (as I’m helping to judge), that the conference is shaping up to be very interesting. There’s a good spread of different topics this year, but I know from having talked to the organisers that they are looking for more submissions in the areas of Operating Systems, Engineered Systems and Development (mobile and cloud).

If you’ve got a paper, presentation, or idea for one that you think would be useful, please go ahead and submit your idea.

I’m also pleased to say that I’ll be at OSCON in Oregon in July, handling a Birds of a Feather (BOF) session on the topic of exchanging data between MySQL, Oracle and Hadoop. I’ll be there with my good friend Eric Herman from Booking.com where we’ll be providing advice, guidance, experiences, and hoping to exchange more ideas, wishes and requirements for heterogeneous environments.

It’d be great to meet you if you want to come along to either conference.

 

 

Revisiting ZFS and MySQL

While at Percona Live this year I was reminded about ZFS and running MySQL on top of a ZFS-based storage platform.

Now I’m a big fan of ZFS (although sadly I don’t get to use it as much as I used to after I shutdown my home server farm), and I did a lot of different testing back while at MySQL to ensure that MySQL, InnoDB and ZFS worked correctly together.

Of course today we have a completely new range of ZFS compatible environments, not least of which are FreeBSD and ZFS on Linux, I think it’s time to revisit some of my original advice on using this combination.

Unfortunately the presentations and MySQL University sessions back then have all been taken down. But that doesn’t mean the advice is any less valid.

Some of the core advice for using InnoDB on ZFS:

  • Configure a single InnoDB tablespace, rather than configuring multiple tablespaces across different disks, and then let ZFS manage the underlying disk using stripes or mirrors or whatever configuration you want. This avoids you having to restart or reconfigure your tablespaces as your data grows, and moves that out to ZFS which can do it much more easily and while the filesystem and database remain online. That means we can do:
innodb_data_file_path = /zpool/data/ibdatafile:10G:autoextend
  • While we’re taking about the InnoDB data files, the best optimisation you can do is to set the ZFS block size to match the InnoDB block size. You should do this *before* you start writing data. That means creating the filesystem and then setting the block size:
zfs set recordsize=8K zpool/data
  • What you can also do is configure a separate filesystem for the InnoDB logs that has a ZPool record size of 128K. That’s less relevant in later versions of ZFS, but actually it does no harm.
  • Switch on I/O compression. Within ZFS this improves I/O time (because less data is read/written physically from/to disk), and therefore improves overall I/O times. The compression is good enough and passive to be able to handle the load while still reducing the overall time.
  • Disable the double-write buffer. The transactional nature of ZFS helps to ensure the validity of data written down to disk, so we don’t need two copies of the data to be written to ensure valid recovery in the case of failure that are normally caused by partial writes of the record data. The performance gain is small, but worth it.
  • Using direct IO (O_DIRECT in your my.cnf) also improves performance for similar reasons. We can be sure with direct writes in ZFS that the information is written down to the right place. EDIT: Thanks to Yves, this is not currently supported on Linux/ZFS right now.
  • Limit the Adjustable Replacement Cache (ARC); without doing this you can end up with ZFS using a lot of cache memory that will be better used at the database level for caching record information. We don’t need the block data cache as well.
  • Configure a separate ZFS Intent Log (ZIL), really a Separate Intent Log (SLOG) – if you are not using SSD throughout, this is a great place to use SSD to speed up your overall disk I/O performance. Using SLOG stores immediate writes out to SSD, enabling ZFS to manage the more effective block writes of information out to slower spinning disks. The real difference is that this lowers disk writes, lowers latency, and lowers overall spinning disk activity, meaning they will probably last longer, not to mention making your system quieter in the process. For the sake of $200 of SSD, you could double your performance and get an extra year or so out the disks.

Surprisingly not much has changed in these key rules, perhaps the biggest different is the change in price of SSD between when I wrote these original rules and today. SSD is cheap(er) today so that many people can afford SSD as their main disk, rather than their storage format, especially if you are building serious machines.


Tungsten Replicator 3.0 is Cloudera Enterprise 5 Certified

One of the key platforms I’ve been testing on for the MySQL to Hadoop replication has been Cloudera, largely driven by customer requirements, but it’s also one of the easiest way to get started with Hadoop.

logo_cloudera_certified

What I’m even more pleased about is the fact that we are proud to announce that Tungsten Replicator 3.0 is certified for use on the new Cloudera Enterprise 5 platform. That means that we’re sure that replicating your data from MySQL to Cloudera 5 and have it work without causing problems or difficulties on the Hadoop loading and materialisation.

Cloudera is a great product, and we’re very happy to be working so effectively with the new Cloudera Enterprise 5. Cloudera certainly makes the core operation of managing and monitoring your Hadoop cluster so much easier, while still providing core functionality from the Hadoop family like Hive, HBase and Impala.

What I’m really interested in is the support for Spark, which will allow much easier live-querying and access to data.  That should make some data processing and live data views much easier to build and query further down the line.


Tungsten Replicator 3.0 is Cloudera Enterprise 5 Certified

One of the key platforms I’ve been testing on for the MySQL to Hadoop replication has been Cloudera, largely driven by customer requirements, but it’s also one of the easiest way to get started with Hadoop.

logo_cloudera_certified

What I’m even more pleased about is the fact that we are proud to announce that Tungsten Replicator 3.0 is certified for use on the new Cloudera Enterprise 5 platform. That means that we’re sure that replicating your data from MySQL to Cloudera 5 and have it work without causing problems or difficulties on the Hadoop loading and materialisation.

Cloudera is a great product, and we’re very happy to be working so effectively with the new Cloudera Enterprise 5. Cloudera certainly makes the core operation of managing and monitoring your Hadoop cluster so much easier, while still providing core functionality from the Hadoop family like Hive, HBase and Impala.

What I’m really interested in is the support for Spark, which will allow much easier live-querying and access to data.  That should make some data processing and live data views much easier to build and query further down the line.


Continuent Replication to Hadoop – Now in Stereo!

Hopefully by now you have already seen that we are working on Hadoop replication. I’m happy to say that it is going really well. I’ve managed to push a few terabytes of data and different data sets through into Hadoop on Cloudera, HortonWorks, and Amazon’s Elastic MapReduce (EMR). For those who have been following my long association with the IBM InfoSphere BigInsights Hadoop product, and I’m pleased to say that it’s working there too. I’ve had to adapt Robert’s original script to work with the different versions of the underlying Hadoop tools and systems to make it compatible. The actual performance and process is unchanged; you just use a different JS-based batchloader script to work with different tools.

Robert has also been simplifying some of the core functionality, such as configuring some fixed pre-determined formats, so you no longer have to explicitly set the field and record separators.

I’ve also been testing the key feature of being able to integrate the provisiong of information using Sqoop and merging that original Sqooped data into Hadoop, and then following up with the change data that the replicator is effectively transferring over. The system works exactly as I’ve just described – start the replicator, Sqoop the data, materialise the view within Hadoop. It’s that easy; in fact, if you want a deeper demonstration of all of these features, we’ve got a video from my recent webinar session:

Real Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

If you can’t spare the time, but still want to know about our Hadoop applier, try our short 5-minute video:

Real-time data loading into Hadoop with Tungsten Replicator

While you’re there, check out the Clustering video I did at the same time:

Continuent Tungsten Clustering

And of course, don’t forget that you can see the product and demos live by attending Percona Live in Santa Clara this week (1st-4th April).


Real-Time Data Loading from MySQL to Hadoop using Tungsten Replicator 3.0 Webinar

To follow-up and describe some of the methods and techniques behind replicating into Hadoop from MySQL in real-time, and how this can be combined into your data workflow, Continuent are running a webinar with me presenting that will go over the details and provide a demo of the data replication process.

Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

Hadoop is an increasingly popular means of analyzing transaction data from MySQL. Up until now mechanisms for moving data between MySQL and Hadoop have been rather limited. The new Continuent Tungsten Replicator 3.0 provides enterprise-quality replication from MySQL to Hadoop. Tungsten Replicator 3.0 is 100% open source, released under a GPL V2 license, and available for download at https://code.google.com/p/tungsten-replicator/. Continuent Tungsten handles MySQL transaction types including INSERT/UPDATE/DELETE operations and can materialize binlogs as well as mirror-image data copies in Hadoop. Continuent Tungsten also has the high performance necessary to load data from busy source MySQL systems into Hadoop clusters with minimal load on source systems as well as Hadoop itself.

This webinar covers the following topics:

- How Hadoop works and why it’s useful for processing transaction data from MySQL
- Setting up Continuent Tungsten replication from MySQL to Hadoop
- Transforming MySQL data within Hadoop to enable efficient analytics
- Tuning replication to maximize performance.

You do not need to be an expert in Hadoop or MySQL to benefit from this webinar. By the end listeners will have enough background knowledge to start setting up replication between MySQL and Hadoop using Continuent Tungsten.

You can join the webinar on 27th March (Thursday), 10am PDT, 1pm EDT, or 5pm GMT by registering here: https://www1.gotomeeting.com/register/225780945

 

 


Parallel Extractor for Provisioning

Coming up as a new feature in Tungsten Replicator (and written by our replicator expert Stephane Giron) is the ability to provision a new database by using data from an existing database. This new feature comes in the form of a tool called the Parallel Extractor.

The principles are very simple. On the master side:

  • Start the master replicator offline.
  • Switch the replicator to the online provision state.
  • The master replicator pulls the data out of the existing database and writes that information into the Transaction History Log (THL). At this point, the normal replicator thread is not extracting events from the source database.
  • Once the parallel replication has completed, the replicator switches over to normal extraction mode, and starts writing change data into the THL.

On the slave side, the THL events are read as usual from the master and applied to the slave, but because the provisioned data is inserted into the start of the THL before the main THL thread, the slave reads the provisioned data first, then the data changes that occurred since the provisioning started.

In fact, it’s best to think of it like the diagram below:

Parallel Extractor Blog THL
The parallel extraction happens in a very specific fashion:

A chunking thread identifies all the tables, and also identifies the keys and chunks that can be extracted from each table. It then coordinates the multiple threads:

  • Multiple chunks from the source tables are extracted in parallel.
  • Multiple tables are extracted in parallel.

Because both of these operations happen at the same time, the parallel extractor can pull from multiple tables and multiple chunks, meaning that the actual extraction of the data happens very quickly. In fact, tests are running at a rate of about 80 million rows/15 mins. That was from a single table.
?????????https://mcslp.wordpress.com/?p=10045&preview=true

Parallel Extractor Blog Figure

Obviously the number of parallel threads can be controlled, and in fact, the chunking is controlled further by use of a configuration file to determine the chunking configuration.

Currently, the parallel extractor is designed to work for Oracle to MySQL provisioning with Tungsten Replicator, but the same principles can be applied to MySQL-to-MySQL setups. Using the parallel extractor is deceptively simple, and you can check out the current, Oracle-related, instructions here.

What this provides is a very simple way to take an entire existing database full of data and seed your target database with that information by using the replicator. This means the Parallel Extractor could be used to provision new slaves when expanding an existing cluster, to convert a single-machine installation to use replication by seeding the slave with the existing data without needing a backup, or, as currently designed, to seed a heterogeneous replication installation with new data without having to use a complex dump, massage and reload process.


Using the Continuent Docs

As hopefully has been noticed, the Continuent documentation is achieving a pretty good critical mass. The content of the documentation is always the most important consideration. Secondary is making sure that the information in the documentation can be found, and that when reading, you can hover and click to get relevant information so that you can understand the content and information being provided even better.

We’ve got a few different solutions and tips that I think are worth highlighting so that people can use the documentation more effectively.

Searching

When you want to look for something in the documentation, use the search bar right up at the top. The search is available both on the Documentation Library page and within individual documents.

Screen Shot 2014-03-12 at 07.13.22

When used on the Documentation Library page, search shows you potential matches across all the documentation for the word or item you are searching for. For example, here where I’ve searched for FAQ. Entries are ranked by the manual according to releases:

Screen Shot 2014-03-12 at 07.17.32

When searching within a document, you get shown the items within this document first, followed by matches within other documents:

Screen Shot 2014-03-12 at 07.22.39

The search content itself is heavily indexed and designed so that you should go to the right item as the first one in the list.

It also works both on wide terms, for example, Filters, but it also works on commands, and command-line arguments and options within a typical command. For example, type ‘trepctl status’ and you will get not only the key command, but all it’s derivatives. But type in an option, like ‘-at-event’, and you’ll get the explicit entry for that item.

Screen Shot 2014-03-12 at 07.28.44

Note that the search is very deliberately not a free-text search. This is to ensure that you get to exactly the right page, rather than all the pages that might mention ‘trepctl status’.

Hover Highlights

When reading the documentation you might come across some terms or information that you are not familiar with. In this case, hover over the item and you’ll get a definition.

Screen Shot 2014-03-12 at 07.40.13

Click the highlighted item, and you’ll get taken to the reference page for that specific item.

Deep Linking

I mentioned the mechanics of this process recently, but the use-case within the documentation is that virtually everything of significance is automatically linked to the right, canonical, page for the information.

For example, in the image below, there are links to the various ONLINE and OFFLINE states that can be clicked on, and the same is true for nearly all filenames, options, commands, and all combinations thereof.

Screen Shot 2014-03-12 at 10.12.11

Related Pages

In certain sections, links to other pages that might be useful to the current discussion, but which we do not directly link to in reference to another item are listed in the sidebar.

This is supported for related pages:

Screen Shot 2014-03-12 at 10.25.58

FAQ entries:

Screen Shot 2014-03-12 at 10.52.26

We don’t have entries yet, but release note and Error/Cause/Solution (troubleshooting) links are supported too. Note that these links only appear on pages that have the related items.

Table of Contents Navigation

Immediately above the related pages is the basic navigation section. These are divided into:

  • Parent Sections – these are sections at the same level as the current page that you might want to jump to. For example, you can easily jump from Fan-In to Star deployments.
  • Navigate Up – Goes up the parent.
  • Chapters – A list of all the chapters and appendices in this manual.

Other Manuals

For each page in each manual we also provide a link to the same page in other manuals. There are two reasons for this, the first is so that you can compare or jump to differences in other versions of the same manual. The second is to jump between the Tungsten Replicator and Continuent Tungsten if you find yourself in the right page, but the wrong product manual.

Screen Shot 2014-03-12 at 10.42.58

So as you can see, there’s a lot more to the docs than just the content (critical though it is), and hopefully this has helped to explain how usable the documentation is and more important how easy it should be to find the information you need.


Using the Continuent Docs

As hopefully has been noticed, the Continuent documentation is achieving a pretty good critical mass. The content of the documentation is always the most important consideration. Secondary is making sure that the information in the documentation can be found, and that when reading, you can hover and click to get relevant information so that you can understand the content and information being provided even better.

We’ve got a few different solutions and tips that I think are worth highlighting so that people can use the documentation more effectively.

Searching

When you want to look for something in the documentation, use the search bar right up at the top. The search is available both on the Documentation Library page and within individual documents.

Screen Shot 2014-03-12 at 07.13.22

When used on the Documentation Library page, search shows you potential matches across all the documentation for the word or item you are searching for. For example, here where I’ve searched for FAQ. Entries are ranked by the manual according to releases:

Screen Shot 2014-03-12 at 07.17.32

When searching within a document, you get shown the items within this document first, followed by matches within other documents:

Screen Shot 2014-03-12 at 07.22.39

The search content itself is heavily indexed and designed so that you should go to the right item as the first one in the list.

It also works both on wide terms, for example, Filters, but it also works on commands, and command-line arguments and options within a typical command. For example, type ‘trepctl status’ and you will get not only the key command, but all it’s derivatives. But type in an option, like ‘-at-event’, and you’ll get the explicit entry for that item.

Screen Shot 2014-03-12 at 07.28.44

Note that the search is very deliberately not a free-text search. This is to ensure that you get to exactly the right page, rather than all the pages that might mention ‘trepctl status’.

Hover Highlights

When reading the documentation you might come across some terms or information that you are not familiar with. In this case, hover over the item and you’ll get a definition.

Screen Shot 2014-03-12 at 07.40.13

Click the highlighted item, and you’ll get taken to the reference page for that specific item.

Deep Linking

I mentioned the mechanics of this process recently, but the use-case within the documentation is that virtually everything of significance is automatically linked to the right, canonical, page for the information.

For example, in the image below, there are links to the various ONLINE and OFFLINE states that can be clicked on, and the same is true for nearly all filenames, options, commands, and all combinations thereof.

Screen Shot 2014-03-12 at 10.12.11

Related Pages

In certain sections, links to other pages that might be useful to the current discussion, but which we do not directly link to in reference to another item are listed in the sidebar.

This is supported for related pages:

Screen Shot 2014-03-12 at 10.25.58

FAQ entries:

Screen Shot 2014-03-12 at 10.52.26

We don’t have entries yet, but release note and Error/Cause/Solution (troubleshooting) links are supported too. Note that these links only appear on pages that have the related items.

Table of Contents Navigation

Immediately above the related pages is the basic navigation section. These are divided into:

  • Parent Sections – these are sections at the same level as the current page that you might want to jump to. For example, you can easily jump from Fan-In to Star deployments.
  • Navigate Up – Goes up the parent.
  • Chapters – A list of all the chapters and appendices in this manual.

Other Manuals

For each page in each manual we also provide a link to the same page in other manuals. There are two reasons for this, the first is so that you can compare or jump to differences in other versions of the same manual. The second is to jump between the Tungsten Replicator and Continuent Tungsten if you find yourself in the right page, but the wrong product manual.

Screen Shot 2014-03-12 at 10.42.58

So as you can see, there’s a lot more to the docs than just the content (critical though it is), and hopefully this has helped to explain how usable the documentation is and more important how easy it should be to find the information you need.


Intelligent Linking and Indexing in DocBook

One of the issues I have with DocBook XML is that the links are a little forced and manual. 

By that, I mean that if I have a command, like trepctl, and I used it in a sentence or description, if I want to link trepctl back to the corresponding trepctl page, I have to manually add it like this:

<link linkend="cmdline-tools-trepctl"><command>trepctl</command></link>

Not only is that a mouthful to say, it’s a lot of keys to type. 

So I’ve fixed that. 

What I do instead is add a <remark> block into the documentation for that command-line page:

<section id="cmdline-tools-trepctl">
<title>The trepctl Command</title>
<remark role="index:canonical" condition="command:trepctl"/>
...
</section>

The ‘role’ attribute specifies the index entry, and that this is canonical, i.e., this is the canon page for <command>trepctl</command>

That’s what is defined in the ‘condition’ attribute. 

This means that when I put <command>trepctl</command>, during post-processing, the command is automatically linked to that page without me having to manually do that. 

You can see the effect of this at the top of this page.

It works for anything, and it works for longer fragments too, so I can do ‘Use the <command>trepctl status</command> command’, and the post-processing will automatically link to the canonical page for the trepctl status command. On that same page, you can see links to the field names in the output. 

This uses an extension to the original index reference format: 

<remark role="index:canonical" condition="parameter:appliedLastSeqno:thl"/>

That third argument to the condition attribute gives a ‘hint’ as to what it might apply to. This means that we can link using a commonly used DocBook element, such as parameter, and tag it to link to this canonical page, just by adding a condition attribute to the parameter element, like this:

<parameter condition="thl">appliedLastSeqno</parameter>

OK, so it’s still long, but it’s less complex than writing out <link> or <xref> elements, and it means that I don’t have to know the ID where the information is held, that’s entirely driven from marking up the content with the canonical index entry. 

Finally, and perhaps the most important point, is that you can go to any of the Continuent documentation pages and type either the partial or full command, and it will take you to the canonical page for that command, option, etc, which means not only is the content heavily linked (making it more useful), but it also makes it more easily searchable to the right place.