Category Archives: big data

Real-Time Replication from MySQL to Cassandra

Earlier this month I blogged about our new Hadoop applier, I published the docs for that this week (http://docs.continuent.com/tungsten-replicator-3.0/deployment-hadoop.html) as part of the Tungsten Replicator 3.0 documentation (http://docs.continuent.com/tungsten-replicator-3.0/index.html). It contains some additional interesting nuggets that will appear in future blog posts.

The main part of that functionality that performs the actual applier for Hadoop is based around a JavaScript applier engine – there will eventually be docs for that as part of the Batch Applier content (http://docs.continuent.com/tungsten-replicator-3.0/deployment-batchloading.html). The core of this system is that it    takes the information from the data stream of the THL and the CSV file that was written by the batch applier system, and runs the commands necessary to load it into Hadoop and perform any necessary merges.

I wanted to see how easy it would be to use the same system to use that same flexible system and bend it to another database system, in my case, I chose Cassandra.

For the record, it took me a couple of hours to have this working, and I’m guessing another hour will file down some of the rough edges.

Cassandra is interesting as a database because it mixes a big distributed key/value store with a close-enough to SQL like interface in the form of CQL. And that means we can make use of the CQL to help us perform the merging into the final tables in a manner not dissimilar to the method we use for loading into Vertica.

Back to the Javascript batch loader, the applier provides five different implementable functions (all are technically optional) that you can use at different stages of the applier process. These are:

  • prepare() – called once when the applier goes online and can be used to create temporary directories or spaces
  • begin() – called at the start of each transaction
  • apply() – called at the end of the transaction once the data file has been written, but before the commit
  • commit() – called after each transaction commit has taken place; this where we can consolidate info.
  • release() – called when the applier goes offline

We can actually align these functions with a typical transaction – prepare() happens before the statements even start, begin() is the same as BEGIN, apply() happens immediately before COMMIT and commit() happens just after. release() can be used to do any clean up afterwards.

So let’s put this into practice and use it for Cassandra.

The basic process for loading is as follows:

  1. Write a CSV file to load into Cassandra
  2. Load the CSV file into a staging table within Cassandra; this is easy through CQL using the ‘COPY tablename FROM filename’ CQL statement.
  3. Merge the staging table data with a live table to create a carbon copy of our MySQL table content.

For the loading portion, what we’ll do is load the CSV into a staging table, and then we’ll merge the staging table and live table data together during the commit stage of our batch applier. We’ll return to this in more detail.

For the merging, we’ll take the information from the staging table, which includes the sequence number and operation type, and then write the ‘latest’ version of that row and put it into the live table. That gives us a structure like this:

Cassandra Loader

Tungsten Replicator is going to manage this entire process for us – all we need to do ins install the replicators, plug in these custom bits, and let it run.

As with the Hadoop applier, what we’re going to do is use the batch applier to generate only insert and delete rows; UPDATE statements will be converted into a delete of the original version and insert of the new version. So:

INSERT INTO sample VALUES (1,’Message’)

Is an insert…

DELETE sample WHERE id  = 1

Is a delete, and:

UPDATE sample SET message = ’Now you see me’ WHERE id = 1

is actually:

DELETE sample WHERE id  = 1
 INSERT INTO sample VALUES (1,’Now you see me’)

This gets round the problem of doing updates (which in big data stores are expensive, particularly Hadoop which doesn’t support updating existing data), into a more efficient delete and insert.

In the CSV data itself, this is represented by prefix every row with three fields:

optype, sequence number, unique id

Optype is ‘D’ for a delete and ‘I’ for an insert and is used to identify what needs to be done. The sequence number is the unique transaction ID from the replicator THL. This number increases by one for every transaction, and this means we can always identify the ‘latest’ version of a row, which is important to us when processing the transaction into Cassandra. the unique ID is the primary key (or compound key) from the source data. We need this to ensure we update the right row. To replicate data in this way, we must have a primary key on the data. If you don’t have primary keys, you are probably in a world of hurt anyway, so it shouldn’t be a stretch.

One difficulty here is that we need to cope with an idiosyncracy of Cassandra, which is that by default, Cassandra orders fields in the ‘tables’ (really collections of key/values) so that integers and numbers appear first in the table, and text appears last. This is an optimisation that Cassandra makes that complicates things for us, but only in a very small way. For the moment, we’ll handle it by assuming that we are loading only one table with a known format into Cassandra. We could handle multiple tables by using a simple IF statement in the JS and using different formats for that, or we could actually extract the info from the incoming data; I’m going to skip that because it keeps us away from the cool element of actually getting the data in.

Within Cassandra then we have two tables, the table we are loading data into, and the staging table that we load the CSV data into. For our sample, the live schema is ‘sample', the live table is ‘sample’ and the staging table is ‘staging_sample’.

The definitions for these in Cassandra are for the sample live table:

 CREATE TABLE sample (
 id int,
 message text,
 PRIMARY KEY (id)
 ) WITH
 bloom_filter_fp_chance=0.010000 AND
 caching='KEYS_ONLY' AND
 comment='' AND
 dclocal_read_repair_chance=0.000000 AND
 gc_grace_seconds=864000 AND
 index_interval=128 AND
 read_repair_chance=0.100000 AND
 replicate_on_write='true' AND
 populate_io_cache_on_flush='false' AND
 default_time_to_live=0 AND
 speculative_retry='99.0PERCENTILE' AND
 memtable_flush_period_in_ms=0 AND
 compaction={'class': 'SizeTieredCompactionStrategy'} AND
 compression={'sstable_compression': 'LZ4Compressor'};

And for the staging_sample table:

CREATE TABLE staging_sample (
 optype text,
 seqno int,
 fragno int,
 id int,
 message text,
 PRIMARY KEY (optype, seqno, fragno, id)
 ) WITH
 bloom_filter_fp_chance=0.010000 AND
 caching='KEYS_ONLY' AND
 comment='' AND
 dclocal_read_repair_chance=0.000000 AND
 gc_grace_seconds=864000 AND
 index_interval=128 AND
 read_repair_chance=0.100000 AND
 replicate_on_write='true' AND
 populate_io_cache_on_flush='false' AND
 default_time_to_live=0 AND
 speculative_retry='99.0PERCENTILE' AND
 memtable_flush_period_in_ms=0 AND
 compaction={'class': 'SizeTieredCompactionStrategy'} AND
 compression={'sstable_compression': 'LZ4Compressor'};

I’ve put both tables into a ‘sample’ collection.

Remember that that idiosyncrasy I mentioned? Here it is, a bare table loading from CSV will actually order the data as:

seqno,uniqno,id,optype,message

This is Cassandra’s way of optimising integers over text to speed up lookups, but for us is a minor niggle. Right now, I’m going to handle it by assuming we are replicating only one schema/table and we we not what the structure of that looks like. Longer term, I want to pull it out of the metadata, but that’s a refinement.

So let’s start by having a look at the basic JS loader script, it’s really the component that is going to handle the core element of the work, managing the CSV files that come in from the batch engine and applying them into Cassandra. Remember, there are five functions that we can define, but for the purposes of this demonstration we’re going to use only two of them, apply(), which will load the CSV file into Cassandra, and the commit() function, which will perform the steps to merge the stage data.

The apply() function does two things, it identifies the table and schema, and then runs the command to load this data into Cassandra through the cqlsh command-line tool. We actually can’t run CQL directly from this command line, but I wrote a quick shell script that pipes CQL from the command-line into a running cqlsh.

The commit() function on the other hand is simpler, although it does a much more complicated job using another external script, this time written in Ruby.

So this gives us a cassandra.js script for the batch applier that looks like this:

function apply(csvinfo)
{
   sqlParams = csvinfo.getSqlParameters();
   csv_file = sqlParams.get("%%CSV_FILE%%");
   schema = csvinfo.schema;
   table = csvinfo.table;
  runtime.exec("/opt/continuent/share/applycqlsh.sh " + schema + ' "copy staging_' + table + " (optype,seqno,uniqno,id,message) from '" + csv_file + "';\"");
}

function commit()
{
  runtime.exec("/opt/continuent/share/merge.rb " + schema);
}

So, the apply() function is called for each event as written into the THL from the MySQL binary log, and the content of the CSV file generated at that point contains the contents of the THL event; if it’s one row, it’s a one-row CSV file; if it’s a statement or transaction that created 2000 rows, it’s a 2000 row CSV file.

The csvinfo object that is provided contains information about the batch file that is written, including, as you can see here, the schema and table names, and the sequence number. Note that we could, at this point, pull out table info, but we’re going to concentrate on pulling a single table here just for demo purposes.

The CQL for loading the CSV data is:

COPY staging_tablename (optype,seqno,uniqno,id,message) from ‘FILENAME’;

This says, copy the the specific columns in this order from the file into the specified table.  As I mentioned, currently this is hard coded into the applier JS, but would be easy to handle for more complex schemas and structures.

The commit() function is even simpler, because it just calls a script that will do the merging for us - we’ll get to that in a minute.

So here’s the script that applies an arbitrary CQL statement into Cassandra:

 #!/bin/bash
SCHEMA=$1;shift
echo "$*" |cqlsh -k $SCHEMA tr-cassandra2

Really simple, but gets round a simple issue.

The script that does the merge work is more complex; in other environments we might be able to do this all within SQL, but CQL is fairly limited with no sub-queries. So we do it long-hand using Ruby. The basic sequence is quite simple, and is in two phases:

  1. Delete every row mentioned in the staging table with an optype of D with a matching unique key
  2. Insert the *last* version of an insert for each unique ID - the last version will be the latest one in the output. We can pick this out by just iterating over every insert and picking the one with the highest Sequence number as generated by the THL transaction ID.
  3. Delete the content from the staging table because we’ve finished with it. That empties the staging table ready for the next set of transactions.

That file looks like this:

#!/usr/bin/ruby

require 'cql'

client = Cql::Client.connect(hosts: ['192.168.1.51'])
client.use('sample')

rows = client.execute("SELECT id FROM staging_sample where optype = 'D'")

deleteids = Array.new()

rows.each do |row|
puts "Found ID #{row['id']} has to be deleted"
deleteids.push(row['id'])
end

deleteidlist = deleteids.join(",")

client.execute("delete from sample where id in (#{deleteidlist})");
puts("delete from sample where id in (#{deleteidlist})");
rows = client.execute("SELECT * FROM staging_sample where optype = 'I'");

updateids = Hash.new()
updatedata = Hash.new()

rows.each do |row|
id = row['id']
puts "Found ID #{id} seq #{row['seqno']} has to be inserted"
if updateids[id]
if updateids[id] < row['seqno']
updateids[id] = row['seqno']
row.delete('seqno')
row.delete('fragno')
row.delete('optype')
updatedata[id] = row
end
else
updateids[id] = row['seqno']
row.delete('seqno')
row.delete('fragno')
row.delete('optype')
updatedata[id] = row
end
end

updatedata.each do |rowid,rowdata|
puts "Should update #{rowdata['id']} with #{rowdata['message']}"
collist = rowdata.keys.join(',')
colcount = rowdata.keys.length
substbase = Array.new()
#  (1..colcount).each {substbase.push('?')}
rowdata.values.each do |value|
if value.is_a?(String)
substbase.push("'" + value.to_s + "'")
else
substbase.push(value)
end
end

substlist = substbase.join(',')

puts('Column list: ',collist)
puts('Subst list: ',substlist)
cqlinsert = "insert into sample ("+collist+") values ("+substlist+")"
puts("Statement: " + cqlinsert)
client.execute(cqlinsert)
end

client.execute("delete from staging_sample where optype in ('D','I')")

Again, currently, this is hard coded, but I could easily of got the schema/table name from the JS batch applier - the actual code is table agnostic and will work with any table.

So, I’ve setup two replicators - one uses the cassandra.js rather than hadoop.js but works the same way, and copied the applycqlsh.sh and merge.rb into /opt/continuent/share.

And we’re ready to run. Let’s try it:

mysql> insert into sample values (0,'First Message’);
Query OK, 1 row affected (0.01 sec)

We’ve inserted one row. Let’s go check Cassandra:

cqlsh:sample> select * from sample;

id  | message
-----+---------------
489 | First Message

Woohoo - data from MySQL straight into Cassandra.

Now let’s try updating it:

mysql> update sample set message = 'Updated Message' where id = 489;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0

And in Cassandra:

cqlsh:sample> select * from sample;

id  | message
-----+-----------------
489 | Updated Message

Bigger woohoo. Not only am I loading data directly into Cassandra, but I can update it as well. Now I can have a stream of update and information within MySQL replicated over to Cassandra for whatever analysis or information that I need without any issues.

Cool huh? I certainly think so (OK, but I’m biased).

Now I haven’t tested it, but this should just as easily work from Oracle; I’ll be testing that and let you know.

Any other database destinations people would like to see for replicating into? If so, let me know and I’ll see what I can do.


Getting Data into Hadoop in real-time

Moving data between databases is hard. Without ever intending it, I seem to have spent a lifetime working on solutions for getting data into and out of databases, but more frequently between. In fact, my first job out of university was migrating data from BRS/Text, a free-text database (probably what we would call a NoSQL) into a more structured Oracle.

Today I spend some of my time working in Big Data, more often than not, migrating information from existing data stores into Big Data so that they can be analysed, something I covered in more detail here:

http://www.ibm.com/developerworks/library/bd-sqltohadoop1/index.html
http://www.ibm.com/developerworks/library/bd-sqltohadoop2/index.html
http://www.ibm.com/developerworks/library/bd-sqltohadoop3/

The problem with the current techniques, Sqoop included, is that they rely on a relatively manual, even basic, transfer process. Dump your data out, reload it back again into Hadoop.

Even with Sqoop, although it automates much of the process, it is not entirely reliable, especially if you want to do more than simple dump and load. Serial loading, or incrementally transferring data from MySQL or Oracle, is fraught with problems, not least of which is that it requires adding a timestamp to your data structure to get the best results out of it.

Perhaps worse though is that Sqoop is an intermittent, even periodic transfer system. Incremental loading works by copying all the changed records since a specific point in time. Running it too frequently is counter productive, which means you end up using a 15-minute or every-couple-of-hour period, depending on your database activity.

Most databases have some kind of stream of changes that enables you to see everything that has happened on the database. With MySQL, that’s the binary log. And with the Open Source Tungsten Replicator tool we take advantage of that so that we can replicate into MySQL, and indeed into Oracle, MongoDB and Vertica, among others.

51c6a7b5abcca6c30f7d79ea8eba17f0

Reading the data out from MySQL is lightweight since the master just reads the contents of the binary log; especially compared to Sqoop, which uses read locks and SELECT * with and without LIMIT clauses.

Right now we’re working on an applier that writes that data into Hadoop in real time from MySQL. Unlike Sqoop, we provide a continuous stream of changes from MySQL into the immutable store of Hadoop.

But the loading and nature of Hadoop presents some interesting issues, not least of which (if you’ve been following my other articles) is the fact that data written into Hadoop is immutable. For data that is constantly changing, an immutable store is not the obvious destination.

We get round that by using the batch loading system to create CSV files that contain the data, changes and sequence numbers, and then loading that information into Hadoop. In fact, Robert has updated the batch loader to use a new JavaScript based system (of which more in a future blog post) that simplifies the entire process, without requiring a direct connection or interface to Hadoop (although we can write directly into HDFS).

For example, the MySQL row:


| 3 | #1 Single | 2006 | Cats and Dogs (#1.4) |

Is represented within the staging files generated as:


I^A1318^A3^A3^A#1 Single^A2006^ACats and Dogs (#1.4)

That’s directly accessible by Hive. In fact, using our ddlscan tool, we can even create the Hive table definitions for you:


ddlscan -user tungsten -url 'jdbc:mysql://host1:13306/test' -pass password \
-template ddl-mysql-hive-0.10.vm -db test

Then we can use that record of changes to create a live version of the data, using a straightforward query within Hive. In fact, Hive provides the final crucial stage of the loading process by giving us that live view of the change data, and we simplify that element by providing the core data, and ensuring that the CSV data is in the right format for Hive to use the files without changes.

The process is quite remarkable; speed-wise for direct dumps, Tungsten Replicator is comparable to Sqoop, but when it comes to change data, the difference is that we have the information in real time. You don’t have to wait for the next Sqoop load, or for the incremental loading and row selection of Sqoop, instead, we just apply the changes written into the binary log.

Of course, we can fine tune the intervals of the writes of the CSV change data into Hadoop using the block commit properties (see http://docs.continuent.com/tungsten-replicator-2.2/performance-block.html). For example, this means by default we commit into Hadoop every 10s or 10,000 rows, but we can change it to commit every 5s or 1,000 rows if your data is critical and busy.

We’re still optimising and improving the system, but I can tell you that in my own tests we can handle GB of change data and information in a live fashion, both across single-table and multi-table/multi-schema datasets. What’s particularly cool is that if you are using Hadoop as a concentrator for all of your MySQL data for analysis, we can transfer from multiple MySQL servers into Hadoop simultaneously and take advantage of the multi-node Hadoop environment to cope with the load.


Process complex text for information mining

My latest article on data mining text information is now available:

Text — an everyday component of nearly all social interaction, social networks, and social sites — is difficult to process. Even the basic task of picking out specific words, phrases, or ideas is challenging. String searches and regex tools don\’t suffice. But the Annotation Query Language (AQL) within IBM InfoSphere® BigInsights™ enables you to make simple and straightforward declarative statements about text and convert that into easily manageable data chunks. Learn how AQL and InfoSphere BigInsights can process text into meaningful data and find out how to convert that information into something usable within the BigSheets environment to get statistical and visualized data from the raw material.

Read Process complex text for information mining.


Building flexible apps from big data sources

My article on how to build flexible apps on top of the BigInsights platform has been published. This demonstrates a cool way to combine some client-end JavaScript and existing technologies to build a Big Data query interface without developing a specialised application for the purpose.

It’s no secret that a significant proportion of the needs for big data have come from the explosion in Internet technologies. Up until 10-20 years ago, the idea of a public-facing application having more than a few million users was unheard of. Today, even a modest website can have millions of users, and if it’s active, can generate millions of data items every day. The irony is that the very infrastructure and systems that create big data can also work in reverse, and provide some of the better ways to integrate and work with that data. Usefully, InfoSphere® BigInsights™ comes with support for managing and executing data jobs through a simple REST API. And through the Jaql interface, we can run queries and get information directly from a Hadoop cluster. This article looks at how these systems work together to give you a rich basis for capturing data and provide an interface to get the information back out again.

Building flexible apps from big data sources.


Process big data with Big SQL in InfoSphere BigInsights

The ability to write an SQL statement against your Big Data stored in Hadoop provides some much needed flexibility. Sure, using Hive or HBase you can perform some of those operations, but there are other alternatives that may suit your needs better, such as the Big SQL utility. My latest article on this tool is provided here:

SQL is a practical querying language, but is has limitations. Big SQL enables you to run complex queries on non-tabular data and query it with an SQL-like language. The difference with Big SQL is that you are accessing data that may be non-tabular, and may in fact not be based upon a typical SQL database structure. Using Big SQL, you can import and process large volume data sets, including by taking the processed output of other processing jobs within InfoSphere BigInsights™ to turn that information into easily query-able data. In this article, we look at how you can replace your existing infrastructure and queries with Big SQL, and how to take more complex queries and convert them to make use of your Big SQL environment.

Process big data with Big SQL in InfoSphere BigInsights.


SQL to Hadoop and back again, Part 3: Direct transfer and live data exchange

The third, and final article in my series on migrating data to and from Hadoop and SQL databases is now available:

Big data is a term that has been used regularly now for almost a decade, and it — along with technologies like NoSQL — are seen as the replacements for the long-successful RDBMS solutions that use SQL. Today, DB2®, Oracle, Microsoft® SQL Server MySQL, and PostgreSQL dominate the SQL space and still make up a considerable proportion of the overall market. In this final article of the series, we will look at more automated solutions for migrating data to and from Hadoop. In the previous articles, we concentrated on methods that take exports or otherwise formatted and extracted data from your SQL source, load that into Hadoop in some way, then process or parse it. But if you want to analyze big data, you probably don’t want to wait while exporting the data. Here, we’re going to look at some methods and tools that enable a live transfer of data between your SQL and Hadoop environments.

SQL to Hadoop and back again, Part 3: Direct transfer and live data exchange.


SQL to Hadoop and back again, Part 2: Leveraging HBase and Hive

The second article in a series covering Big Data and SQL interaction is available now:

“Big data” is a term that has been used regularly now for almost a decade, and it — along with technologies like NoSQL — are seen as the replacements for the long-successful RDBMS solutions that use SQL. Today, DB2®, Oracle, Microsoft® SQL Server MySQL, and PostgreSQL dominate the SQL space and still make up a considerable proportion of the overall market. Here in Part 2, we will concentrate on how to use HBase and Hive for exchanging data with your SQL data stores. From the outside, the two systems seem to be largely similar, but the systems have very different goals and aims. Let\’s start by looking at how the two systems differ and how we can take advantage of that in our big data requirements.

SQL to Hadoop and back again, Part 2: Leveraging HBase and Hive.