All posts by Martin MC Brown

a.k.a.: Martin MC Brown a.k.a.: Martin Brown a.k.a.: mcslp a.k.a.: Martin C Brown a.k.a.: MC

Real-Time Data Movement: The Key to Enabling Live Analytics With Hadoop

An article about moving data into Hadoop in real-time has just been published over at DBTA, written by me and my CEO Robert Hodges.

In the article I talk about one of the major issues for all people deploying databases in the modern heterogenous world – how do we move and migrate data effectively between entirely different database systems in a way that is efficient and usable. How do you get the data you need to the database you need it in. If your source is a transactional database, how does that data get moved into Hadoop in a way that makes the data usable to be queried by Hive, Impala or HBase?

You can read the full article here: Real-Time Data Movement: The Key to Enabling Live Analytics With Hadoop

 

Cross your Fingers for Tech14, see you at OSCON

i

So I’ve submitted my talks for the Tech14 UK Oracle User Group conference which is in Liverpool this year. I’m not going to give away the topics, but you can imagine they are going to be about data translation and movement and how to get your various databases talking together.

I can also say, after having seen other submissions for talks this year (as I’m helping to judge), that the conference is shaping up to be very interesting. There’s a good spread of different topics this year, but I know from having talked to the organisers that they are looking for more submissions in the areas of Operating Systems, Engineered Systems and Development (mobile and cloud).

If you’ve got a paper, presentation, or idea for one that you think would be useful, please go ahead and submit your idea.

I’m also pleased to say that I’ll be at OSCON in Oregon in July, handling a Birds of a Feather (BOF) session on the topic of exchanging data between MySQL, Oracle and Hadoop. I’ll be there with my good friend Eric Herman from Booking.com where we’ll be providing advice, guidance, experiences, and hoping to exchange more ideas, wishes and requirements for heterogeneous environments.

It’d be great to meet you if you want to come along to either conference.

 

 

Passion for Newspaper Comics? Watch Stripped

I’m a big fan of comics – and although I am a fan of Spiderman, Superman, and my personal favourite, 2000AD – what I’m really talking about is the newspaper comics featuring stars like Garfield, Dilbert, and Calvin and Hobbes.

Unfortunately being in the UK, before the internet existed in it’s current form, finding these comics, particularly from the US was difficult. We don’t have many US comics in UK newspapers, and to be honest, very few papers in the UK have a good variety of any comic. That made feeding the habit difficult, as I would trawl, literally, around bookstores in the humour section to find the books I needed.

Garfield was my first foray into the market, and I bought one of the first books not long after it came out. Then, as I started looking around a little more I came across others, like Luann, For Better or For Worse, before finding the absolute joy that was Calvin and Hobbes before ultimately getting hold of Foxtrot, Sherman’s Lagoon and many many more.

Of course, the Internet has made these hugely accessible, and indeed not only do I read many comics every day, but I very deliberately subscribe (and by that, I mean pay money) to both Comics Kingdom (43 daily comics in my subscription) and GoComics.com (72 daily comics) I also continue to the buy the books. Why?

Because at the end of the today looking at screens and taxing the brain, what I really want to do is chill and read some still intelligent, but not mentally taxing, content, and that means reading my comic books. They give me a break and giggle and I find that a nice way to go to sleep.

The more important reason, though, is because I enjoy these comics and believe these people should be rewarded for their efforts. Honestly, these guys work their laughter muscles harder than most people I know, creating new jokes, every day, that make me laugh. They don’t just do this regularly, or even frequently. They do it *every day*.

As a writer I know how hard it is to create new content every day, and keep it interesting. I cannot imagine how hard it is to keep doing it, and making it funny and enjoyable for people to read.

Over the years, I’ve also bought a variety of interesting things, including the massive Dilbert, Calvin & Hobbes and Far Side collectibles. I own complete collections of all the books for my favourite authors, and I’ve even contacted the authors directly when I haven’t been able to get them from the mighty Amazon. To people like Hilary B Price (Rhymes with Orange), Tony Carillo (F-Minus), Scott Hilburn (The Argyle Sweater), Leigh Rubin (Rubes) and Dave Blazek (Loose Parts) I thank you for your help in feeding my addiction. To Mark Leiknes (the now defunct Cow & Boy), I thank you for the drawings from your drawing board and notebook, and I’m Sorry it didn’t work out.

But to Dave Kellett & Fred Schroeder I owe a debt of special gratitude. Of course Dave Kellett writes the excellent Sheldon, and not only do I have the full set, Dave signed them first. I’ve also got one of the limited editions Arthur’s…

But together, they produced the wonderful Stripped! which I funded through Kickstarter along with so many others (you can even see my name in the credits!). If you have any interest in how comics are drawn, where the ideas come from, and how difficult the whole process is, you should watch it. Even more, you should watch it if you want to know what these people look like.

Comic artists are people who for some people we don’t even know their name, but for some we might know, but probably very few who we ever get to see what they look like. Yet these people are superstars. Really. Think about it, they write the screenplay, direct it, produce it, provide all the special effects, act all the parts, and do all the voices. And despite wearing all of these different hats, every day, they can still be funny, and, like all good comedy, thought provoking.

For me there is one poignant moment in the film too. Understanding how, in a world where newspapers and comic syndication is dwindling fast, how these people expect to make a living. The Internet is a great way for comic artists to get exposure to an ever growing army of fans, but I think there’s going to be an interesting crossover period for those comics that started out in the papers.

The film itself is great. Not only do you get to see these comic artist gods, but you get to understand their passion and interest, and why they do what they do. That goes a long way to helping you empathise with them and their passion in line with you and your passion – reading them.

If you like comics, find a way of giving some money back to these people, whether it’s a subscription, buying their books or buying merchandise.

 

Revisiting ZFS and MySQL

While at Percona Live this year I was reminded about ZFS and running MySQL on top of a ZFS-based storage platform.

Now I’m a big fan of ZFS (although sadly I don’t get to use it as much as I used to after I shutdown my home server farm), and I did a lot of different testing back while at MySQL to ensure that MySQL, InnoDB and ZFS worked correctly together.

Of course today we have a completely new range of ZFS compatible environments, not least of which are FreeBSD and ZFS on Linux, I think it’s time to revisit some of my original advice on using this combination.

Unfortunately the presentations and MySQL University sessions back then have all been taken down. But that doesn’t mean the advice is any less valid.

Some of the core advice for using InnoDB on ZFS:

  • Configure a single InnoDB tablespace, rather than configuring multiple tablespaces across different disks, and then let ZFS manage the underlying disk using stripes or mirrors or whatever configuration you want. This avoids you having to restart or reconfigure your tablespaces as your data grows, and moves that out to ZFS which can do it much more easily and while the filesystem and database remain online. That means we can do:
innodb_data_file_path = /zpool/data/ibdatafile:10G:autoextend
  • While we’re taking about the InnoDB data files, the best optimisation you can do is to set the ZFS block size to match the InnoDB block size. You should do this *before* you start writing data. That means creating the filesystem and then setting the block size:
zfs set recordsize=8K zpool/data
  • What you can also do is configure a separate filesystem for the InnoDB logs that has a ZPool record size of 128K. That’s less relevant in later versions of ZFS, but actually it does no harm.
  • Switch on I/O compression. Within ZFS this improves I/O time (because less data is read/written physically from/to disk), and therefore improves overall I/O times. The compression is good enough and passive to be able to handle the load while still reducing the overall time.
  • Disable the double-write buffer. The transactional nature of ZFS helps to ensure the validity of data written down to disk, so we don’t need two copies of the data to be written to ensure valid recovery in the case of failure that are normally caused by partial writes of the record data. The performance gain is small, but worth it.
  • Using direct IO (O_DIRECT in your my.cnf) also improves performance for similar reasons. We can be sure with direct writes in ZFS that the information is written down to the right place. EDIT: Thanks to Yves, this is not currently supported on Linux/ZFS right now.
  • Limit the Adjustable Replacement Cache (ARC); without doing this you can end up with ZFS using a lot of cache memory that will be better used at the database level for caching record information. We don’t need the block data cache as well.
  • Configure a separate ZFS Intent Log (ZIL), really a Separate Intent Log (SLOG) – if you are not using SSD throughout, this is a great place to use SSD to speed up your overall disk I/O performance. Using SLOG stores immediate writes out to SSD, enabling ZFS to manage the more effective block writes of information out to slower spinning disks. The real difference is that this lowers disk writes, lowers latency, and lowers overall spinning disk activity, meaning they will probably last longer, not to mention making your system quieter in the process. For the sake of $200 of SSD, you could double your performance and get an extra year or so out the disks.

Surprisingly not much has changed in these key rules, perhaps the biggest different is the change in price of SSD between when I wrote these original rules and today. SSD is cheap(er) today so that many people can afford SSD as their main disk, rather than their storage format, especially if you are building serious machines.


Harvest machine data using Hadoop and Hive

A new article on has been published on IBM developerWorks, looking at the basics of processing machine data using Hadoop, from extracting the core data, storing it, and then determining the baselines and trigger points required to identifying worrying trends and points. From the intro:

Machine data can come in many different formats and quantities. Weather sensors, fitness trackers, and even air-conditioning units produce massive amounts of data, which begs for a big data solution. But how do you decide what data is important, and how do you determine what proportion of that information is valid, worth including in reports, or valuable in detecting alert situations? This article covers some of the challenges and solutions for supporting the consumption of massive machine data sets that use big data technology and Hadoop.

Harvest machine data using Hadoop and Hive.


Tungsten Replicator 3.0 is Cloudera Enterprise 5 Certified

One of the key platforms I’ve been testing on for the MySQL to Hadoop replication has been Cloudera, largely driven by customer requirements, but it’s also one of the easiest way to get started with Hadoop.

logo_cloudera_certified

What I’m even more pleased about is the fact that we are proud to announce that Tungsten Replicator 3.0 is certified for use on the new Cloudera Enterprise 5 platform. That means that we’re sure that replicating your data from MySQL to Cloudera 5 and have it work without causing problems or difficulties on the Hadoop loading and materialisation.

Cloudera is a great product, and we’re very happy to be working so effectively with the new Cloudera Enterprise 5. Cloudera certainly makes the core operation of managing and monitoring your Hadoop cluster so much easier, while still providing core functionality from the Hadoop family like Hive, HBase and Impala.

What I’m really interested in is the support for Spark, which will allow much easier live-querying and access to data.  That should make some data processing and live data views much easier to build and query further down the line.


Tungsten Replicator 3.0 is Cloudera Enterprise 5 Certified

One of the key platforms I’ve been testing on for the MySQL to Hadoop replication has been Cloudera, largely driven by customer requirements, but it’s also one of the easiest way to get started with Hadoop.

logo_cloudera_certified

What I’m even more pleased about is the fact that we are proud to announce that Tungsten Replicator 3.0 is certified for use on the new Cloudera Enterprise 5 platform. That means that we’re sure that replicating your data from MySQL to Cloudera 5 and have it work without causing problems or difficulties on the Hadoop loading and materialisation.

Cloudera is a great product, and we’re very happy to be working so effectively with the new Cloudera Enterprise 5. Cloudera certainly makes the core operation of managing and monitoring your Hadoop cluster so much easier, while still providing core functionality from the Hadoop family like Hive, HBase and Impala.

What I’m really interested in is the support for Spark, which will allow much easier live-querying and access to data.  That should make some data processing and live data views much easier to build and query further down the line.


Continuent Replication to Hadoop – Now in Stereo!

Hopefully by now you have already seen that we are working on Hadoop replication. I’m happy to say that it is going really well. I’ve managed to push a few terabytes of data and different data sets through into Hadoop on Cloudera, HortonWorks, and Amazon’s Elastic MapReduce (EMR). For those who have been following my long association with the IBM InfoSphere BigInsights Hadoop product, and I’m pleased to say that it’s working there too. I’ve had to adapt Robert’s original script to work with the different versions of the underlying Hadoop tools and systems to make it compatible. The actual performance and process is unchanged; you just use a different JS-based batchloader script to work with different tools.

Robert has also been simplifying some of the core functionality, such as configuring some fixed pre-determined formats, so you no longer have to explicitly set the field and record separators.

I’ve also been testing the key feature of being able to integrate the provisiong of information using Sqoop and merging that original Sqooped data into Hadoop, and then following up with the change data that the replicator is effectively transferring over. The system works exactly as I’ve just described – start the replicator, Sqoop the data, materialise the view within Hadoop. It’s that easy; in fact, if you want a deeper demonstration of all of these features, we’ve got a video from my recent webinar session:

Real Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

If you can’t spare the time, but still want to know about our Hadoop applier, try our short 5-minute video:

Real-time data loading into Hadoop with Tungsten Replicator

While you’re there, check out the Clustering video I did at the same time:

Continuent Tungsten Clustering

And of course, don’t forget that you can see the product and demos live by attending Percona Live in Santa Clara this week (1st-4th April).


Real-Time Data Loading from MySQL to Hadoop using Tungsten Replicator 3.0 Webinar

To follow-up and describe some of the methods and techniques behind replicating into Hadoop from MySQL in real-time, and how this can be combined into your data workflow, Continuent are running a webinar with me presenting that will go over the details and provide a demo of the data replication process.

Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

Hadoop is an increasingly popular means of analyzing transaction data from MySQL. Up until now mechanisms for moving data between MySQL and Hadoop have been rather limited. The new Continuent Tungsten Replicator 3.0 provides enterprise-quality replication from MySQL to Hadoop. Tungsten Replicator 3.0 is 100% open source, released under a GPL V2 license, and available for download at https://code.google.com/p/tungsten-replicator/. Continuent Tungsten handles MySQL transaction types including INSERT/UPDATE/DELETE operations and can materialize binlogs as well as mirror-image data copies in Hadoop. Continuent Tungsten also has the high performance necessary to load data from busy source MySQL systems into Hadoop clusters with minimal load on source systems as well as Hadoop itself.

This webinar covers the following topics:

- How Hadoop works and why it’s useful for processing transaction data from MySQL
- Setting up Continuent Tungsten replication from MySQL to Hadoop
- Transforming MySQL data within Hadoop to enable efficient analytics
- Tuning replication to maximize performance.

You do not need to be an expert in Hadoop or MySQL to benefit from this webinar. By the end listeners will have enough background knowledge to start setting up replication between MySQL and Hadoop using Continuent Tungsten.

You can join the webinar on 27th March (Thursday), 10am PDT, 1pm EDT, or 5pm GMT by registering here: https://www1.gotomeeting.com/register/225780945