Tag Archives: MySQL

analytics, big data, data migration, hadoop, oracle

Making Real-Time Analytics a Reality — TDWI -The Data Warehousing Institute

July 15, 2014 Martin MC Brown

My article on how to make the real-time processing of information from traditional transactional stores into Hadoop a reality has been published over at TDWI:

Making Real-Time Analytics a Reality — TDWI -The Data Warehousing Institute.

continuent, oracle, replication

Replicating Oracle Webinar Question Follow-up

June 16, 2014 Martin MC Brown

We had really great webinar on Replicating to/from Oracle earliest this month, and you can view the recording of that Webinar here.

A good sign of how great a Webinar was is the questions that come afterwards, and we didn’t get through them all. so here are all the questions and answers for the entire webinar.

Q: What is the overhead of Replicator on source database with asynchronous CDC?

A: With asynchronous operation there is no substantial CPU overhead (as with synchronous), but the amount of generated redo logs becomes bigger requiring more disk space and better log management to ensure that the space is used effectively.

Q: Do you support migration from Solaris/Oracle to Linux/Oracle?

A: The replication is not certified for use on Solaris, however, it is possible to configure a replicator to operate remotely and extract from a remote Oracle instance. This is achieved by installing Tungsten Replicator on Linux and then extracting from the remote Oracle instance.

Q: Are there issues in supporting tables without Primary Keys on Oracle to Oracle replication?

A: Non-primary key tables will work, but it is not recommended for production as it implies significant overhead when applying to a target database.

Q: On Oracle->Oracle replication, if there are triggers on source tables, how is this handled?

A: Tungsten Replicator does not automatically disable triggers. The best solution is to remove triggers on slaves, or rewrite triggers to identify whether a trigger is being executed on the master or slave and skip it accordingly, although this requires rewriting the triggers in question.

Q: How is your offering different/better than Oracle Streams replication?

A: We like to think of ourselves as GoldenGate without the price tag. The main difference is the way we extract the information from Oracle, otherwise, the products offer similar functionality. For Tungsten Replicator in particular, one advantage is the open and flexible nature, since Tungsten Replicator is open source, released under a GPL V2 license, and available at https://code.google.com/p/tungsten-replicator/.

Q: How is the integrity of the replica maintained/verified?

A: Replicator has built-in real-time consistency checks: if an UPDATE or DELETE doesn’t update any rows, Replicator will go OFFLINE:ERROR, as this indicates an inconsistent dataset.

Q: Can configuration file based passwords be specified using some form of encrypted value for security purposes to keep them out of the clear?

A: We support an INI file format so that you do not have to use the command-line installation process. There is currently no supported option for an encrypted version of these values, but the INI file can be secured so it is only readable by the Tungsten user.

Q: Our source DB is Oracle RAC with ~10 instances. Is coherency maintained in the replication from activity in the various instances?

A: We do not monitor the information that has been replicated; but CDC replicates row-based data, not statements, so typical sequence insertion issues that might occur with statement based replication should not apply.

Q: Is there any maintenance of Oracle sequence values between Oracle and replicas?

A: Sequence values are recorded into the row data as extracted by Tungsten Replicator. Because the inserted values, not the sequence itself, is replicated, there is no need to maintain sequences between hosts.

Q: How timely is the replication? Particularly for hot source tables receiving millions of rows per day?

A: CDC is based on extracting the data at an interval, but the interval can be configured. In practice, assuming there are regular inserts and updates on the Oracle side, the data is replicated in real-time. See https://docs.continuent.com/tungsten-replicator-3.0/deployment-oracle-cdctuning.html for more information on how this figure can be tuned.

Q: Can parallel extractor instances be spread across servers rather than through threads on the same server (which would be constrained by network or HBA)?

A: Yes. We can install multiple replicators and tune the extraction of the parallel extractor accordingly. However, that selection would need to be manual, but certainly that is possible.

Q: Do you need the CSV file (to select individual tables with the setupCDC.sh configuration) on the master setup if you want all tables?

A: No.

Q: If you lose your slave down the road, do you need to re-provision from the initial SCN number or is there a way to start from a later point?

A: This is the reason for the THL Sequence Number introduced in the extractor. If you lose your slave, you can install a new slave and have it start at the transaction number where the failed slave stopped if you know it, since the information will be in the THL. If not, you can usually determine this by examining the THL directly. There should be no need to re-provision – just to restart from the transaction in the THL on the master.

Q: Regarding a failed slave, what if it failed such that we don’t have a backup or wanted to provision a second slave such that it had no initial data.

A: If you had no backups or data, yes, you would need to re-provision with the parallel extractor in order to seed the target database.

Q: Would you do that with the original SCN? If it had been a month or two, is there a way to start at a more recent SCN (e.g. you have to re-run the setupCDC process)?

A: The best case is to have two MySQL slaves and when one fails, you re-provision it from the healthy one. This avoids setupCDC stage.

However, the replication can always be started from a specific event (SCN) provided that SCN is available in the Oracle undo log space.

Q: How does Tungsten handle Oracle’s CLOB and BLOB data types

A: Providing you are using asynchronous CDC these types are supported; for synchronous CDC these types are not supported by Oracle.

Q: Can different schemas in Oracle be replicated at different times?

A: Each schema is extracted by a separate service in Replicator, so they are independent.

Q: What is the size limit for BLOB or CLOB column data types?

A: This depends on the CDC capabilities in Oracle, and is not limited within Tungsten Replicator. You may want to refer to the Oracle Docs for more information on CDC: http://docs.oracle.com/cd/B28359_01/server.111/b28313/cdc.htm

Q: With different versions of Oracle e.g. enterprise edition and standard edition one be considered heterogeneous environments?

A: Essentially yes, although the nomenclature is really only a categorization, it does not affect the operation, deployment or functionality of the replicator. All these features are part of the open source product.

Q: Can a 10g database (master) send the data to a 11g database (slave) for use in an upgrade?

A: Yes.

Q: Does the Oracle replicator require the Oracle database to be in archive mode?

A: Yes. This is a requirement for Oracle’s CDC implementation.

Q: How will be able to revisit this recorded webinar?

A: Slides and a recording from today’s webinar will be available at http://www.slideshare.net/Continuent_Tungsten

continuent

A New Home for Tungsten in the UK

June 9, 2014 Martin MC Brown

I was suitably heartened to hear about the new mine opening up in the Devon here in the UK to mine the element Tungsten.

I comment on this to my associates at Continuent, where comments were made by Csaba as to the appropriate quotes in the article:

“Tungsten is an extraordinary metal.”

“It’s almost as hard as a diamond and has one of the highest melting points of any mineral.”

“Adding a small amount to steel makes it far harder, far more resistant to stress and heat. The benefits to industry are obvious.”

Leading to him to suggest Adding a small amount of Tungsten to MySQL makes it far harder, far more resistant to stress and failures. The benefits to industry are obvious.

I couldn’t possibly agree more!

big data, continuent, hadoop, oracle, Presentations and Conferences

Continuent at Hadoop Summit

May 30, 2014 Martin MC Brown

I’m pleased to say that Continuent will be at the Hadoop Summit in San Jose next week (3-5 June). Sadly I will not be attending as I’m taking an exam next week, but my colleagues Robert Hodges, Eero Teerikorpi and Petri Versunen will be there to answer any questions you have about Continuent products, and, of course, Hadoop replication support built into Tungsten Replicator 3.0.

If you are at the conference, please go along and say hi to the team. And, as always, if there are any questions please let them or me know.

big data, hadoop, oracle

Real-Time Data Movement: The Key to Enabling Live Analytics With Hadoop

May 22, 2014 Martin MC Brown

An article about moving data into Hadoop in real-time has just been published over at DBTA, written by me and my CEO Robert Hodges.

In the article I talk about one of the major issues for all people deploying databases in the modern heterogenous world – how do we move and migrate data effectively between entirely different database systems in a way that is efficient and usable. How do you get the data you need to the database you need it in. If your source is a transactional database, how does that data get moved into Hadoop in a way that makes the data usable to be queried by Hive, Impala or HBase?

You can read the full article here: Real-Time Data Movement: The Key to Enabling Live Analytics With Hadoop

big data, conferences, continuent, hadoop, oracle, Presentations and Conferences, ukoug

Cross your Fingers for Tech14, see you at OSCON

May 15, 2014 Martin MC Brown

So I’ve submitted my talks for the Tech14 UK Oracle User Group conference which is in Liverpool this year. I’m not going to give away the topics, but you can imagine they are going to be about data translation and movement and how to get your various databases talking together.

I can also say, after having seen other submissions for talks this year (as I’m helping to judge), that the conference is shaping up to be very interesting. There’s a good spread of different topics this year, but I know from having talked to the organisers that they are looking for more submissions in the areas of Operating Systems, Engineered Systems and Development (mobile and cloud).

If you’ve got a paper, presentation, or idea for one that you think would be useful, please go ahead and submit your idea.

I’m also pleased to say that I’ll be at OSCON in Oregon in July, handling a Birds of a Feather (BOF) session on the topic of exchanging data between MySQL, Oracle and Hadoop. I’ll be there with my good friend Eric Herman from Booking.com where we’ll be providing advice, guidance, experiences, and hoping to exchange more ideas, wishes and requirements for heterogeneous environments.

It’d be great to meet you if you want to come along to either conference.

Coalface, continuent, linux, zfs

Revisiting ZFS and MySQL

May 1, 2014 Martin MC Brown

While at Percona Live this year I was reminded about ZFS and running MySQL on top of a ZFS-based storage platform.

Now I’m a big fan of ZFS (although sadly I don’t get to use it as much as I used to after I shutdown my home server farm), and I did a lot of different testing back while at MySQL to ensure that MySQL, InnoDB and ZFS worked correctly together.

Of course today we have a completely new range of ZFS compatible environments, not least of which are FreeBSD and ZFS on Linux, I think it’s time to revisit some of my original advice on using this combination.

Unfortunately the presentations and MySQL University sessions back then have all been taken down. But that doesn’t mean the advice is any less valid.

Some of the core advice for using InnoDB on ZFS:

Configure a single InnoDB tablespace, rather than configuring multiple tablespaces across different disks, and then let ZFS manage the underlying disk using stripes or mirrors or whatever configuration you want. This avoids you having to restart or reconfigure your tablespaces as your data grows, and moves that out to ZFS which can do it much more easily and while the filesystem and database remain online. That means we can do:

innodb_data_file_path = /zpool/data/ibdatafile:10G:autoextend

While we’re taking about the InnoDB data files, the best optimisation you can do is to set the ZFS block size to match the InnoDB block size. You should do this *before* you start writing data. That means creating the filesystem and then setting the block size:

zfs set recordsize=8K zpool/data

What you can also do is configure a separate filesystem for the InnoDB logs that has a ZPool record size of 128K. That’s less relevant in later versions of ZFS, but actually it does no harm.
Switch on I/O compression. Within ZFS this improves I/O time (because less data is read/written physically from/to disk), and therefore improves overall I/O times. The compression is good enough and passive to be able to handle the load while still reducing the overall time.
Disable the double-write buffer. The transactional nature of ZFS helps to ensure the validity of data written down to disk, so we don’t need two copies of the data to be written to ensure valid recovery in the case of failure that are normally caused by partial writes of the record data. The performance gain is small, but worth it.
Using direct IO (O_DIRECT in your my.cnf) also improves performance for similar reasons. We can be sure with direct writes in ZFS that the information is written down to the right place. EDIT: Thanks to Yves, this is not currently supported on Linux/ZFS right now.
Limit the Adjustable Replacement Cache (ARC); without doing this you can end up with ZFS using a lot of cache memory that will be better used at the database level for caching record information. We don’t need the block data cache as well.
Configure a separate ZFS Intent Log (ZIL), really a Separate Intent Log (SLOG) – if you are not using SSD throughout, this is a great place to use SSD to speed up your overall disk I/O performance. Using SLOG stores immediate writes out to SSD, enabling ZFS to manage the more effective block writes of information out to slower spinning disks. The real difference is that this lowers disk writes, lowers latency, and lowers overall spinning disk activity, meaning they will probably last longer, not to mention making your system quieter in the process. For the sake of $200 of SSD, you could double your performance and get an extra year or so out the disks.

Surprisingly not much has changed in these key rules, perhaps the biggest different is the change in price of SSD between when I wrote these original rules and today. SSD is cheap(er) today so that many people can afford SSD as their main disk, rather than their storage format, especially if you are building serious machines.

big data, cloudera, Cloudera Enterprise, Coalface, continuent, hadoop, replication, tungsten-replicator

Tungsten Replicator 3.0 is Cloudera Enterprise 5 Certified

April 2, 2014 Martin MC Brown

One of the key platforms I’ve been testing on for the MySQL to Hadoop replication has been Cloudera, largely driven by customer requirements, but it’s also one of the easiest way to get started with Hadoop.

What I’m even more pleased about is the fact that we are proud to announce that Tungsten Replicator 3.0 is certified for use on the new Cloudera Enterprise 5 platform. That means that we’re sure that replicating your data from MySQL to Cloudera 5 and have it work without causing problems or difficulties on the Hadoop loading and materialisation.

Cloudera is a great product, and we’re very happy to be working so effectively with the new Cloudera Enterprise 5. Cloudera certainly makes the core operation of managing and monitoring your Hadoop cluster so much easier, while still providing core functionality from the Hadoop family like Hive, HBase and Impala.

What I’m really interested in is the support for Spark, which will allow much easier live-querying and access to data. That should make some data processing and live data views much easier to build and query further down the line.

big data, cloudera, Cloudera Enterprise, Coalface, continuent, hadoop, replication, tungsten-replicator

Tungsten Replicator 3.0 is Cloudera Enterprise 5 Certified

April 2, 2014 Martin MC Brown

big data, Coalface, continuent, hadoop, replication, tungsten-replicator

Continuent Replication to Hadoop – Now in Stereo!

March 31, 2014 Martin MC Brown

Hopefully by now you have already seen that we are working on Hadoop replication. I’m happy to say that it is going really well. I’ve managed to push a few terabytes of data and different data sets through into Hadoop on Cloudera, HortonWorks, and Amazon’s Elastic MapReduce (EMR). For those who have been following my long association with the IBM InfoSphere BigInsights Hadoop product, and I’m pleased to say that it’s working there too. I’ve had to adapt Robert’s original script to work with the different versions of the underlying Hadoop tools and systems to make it compatible. The actual performance and process is unchanged; you just use a different JS-based batchloader script to work with different tools.

Robert has also been simplifying some of the core functionality, such as configuring some fixed pre-determined formats, so you no longer have to explicitly set the field and record separators.

I’ve also been testing the key feature of being able to integrate the provisiong of information using Sqoop and merging that original Sqooped data into Hadoop, and then following up with the change data that the replicator is effectively transferring over. The system works exactly as I’ve just described – start the replicator, Sqoop the data, materialise the view within Hadoop. It’s that easy; in fact, if you want a deeper demonstration of all of these features, we’ve got a video from my recent webinar session:

Real Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

If you can’t spare the time, but still want to know about our Hadoop applier, try our short 5-minute video:

Real-time data loading into Hadoop with Tungsten Replicator

While you’re there, check out the Clustering video I did at the same time:

Continuent Tungsten Clustering

And of course, don’t forget that you can see the product and demos live by attending Percona Live in Santa Clara this week (1st-4th April).

Planet MCB Guru

Tag Archives: MySQL

Making Real-Time Analytics a Reality — TDWI -The Data Warehousing Institute

Replicating Oracle Webinar Question Follow-up

A New Home for Tungsten in the UK

Continuent at Hadoop Summit

Real-Time Data Movement: The Key to Enabling Live Analytics With Hadoop

Cross your Fingers for Tech14, see you at OSCON

Revisiting ZFS and MySQL

Tungsten Replicator 3.0 is Cloudera Enterprise 5 Certified

Tungsten Replicator 3.0 is Cloudera Enterprise 5 Certified

Continuent Replication to Hadoop – Now in Stereo!

All the MCB Guru blogs that are fit to print