Tag Archives: Articles

Comparing MySQL to Vertica Replication under MemCloud, AWS and Bare Metal

Back in December, I did a detailed analysis for getting data into Vertica from MySQL using Tungsten Replicator, all within the Kodiak MemCloud.

I got some good numbers towards the end – 1.9 million rows/minute into Vertica. I did this using a standard replicator deployment, plus some tweaks to the Vertica environment. In particular:

  • Integer hash for a partition for both the staging and base tables
  • Some tweaks to the queries to ensure that we used the partitions in the most efficient manner
  • Optimized the batching within the applier to hit the right numbers for the transaction counts

That last one is a bit of a cheat because in a real-world situation it’s much harder to be able to identify those transaction sizes and row counts, but for testing, we’re trying to get the best performance!

Next what I wanted to do was set up some bare metal and AWS servers that were of an equivalent configuration and see what I could do to repeat and emulate the tests and see what comparable performance we could get.

How I Load Masses of Data

Before I dip into that, however, I thought it would be worth seeing how I generate the information in the first place. With big data testing (mainly when trying to simulate the data that ultimately gets written into your analytics target) the primary concern is one of reproducing the quantity as well as the variety of the data.

It’s application dependent, but for some analytics tasks the inserts are quite high and the updates/deletes relatively low. So I’ve written a test script that generates up to a million rows of data, split to be around 65% inserts, 25% updates and 10% deletes.

I can tweak that of course, but I’ve found it gives a good spread of data. I can also configure whether that happens in one transaction or each row is a transaction of its own. That all gets dumped into an SQL file. A separate wrapper script and tool then load that information into MySQL, either using redirection within the MySQL command line tool or through a different lightweight C++ client I wrote.

The data itself is light, two columns, an auto-incrementing integer ID and a random string. I’m checking for row inserts here, not data sizes.

So, to summarise:

  • Up to 1 million rows (although this is configurable)
  • Single or multiple transactions
  • Single schema/table or numerous schemas/tables
  • Concurrent, multi-threaded inserts

The fundamental result here is that I can predict the number of transactions and rows, which is really important when you are trying to measure rows-per-time period to use as benchmarks with replication because I can also start and stop replication on the transaction count boundaries to get precise performance.

For the main testing that I use for the performance results, what I do is run a multi-threaded, simultaneous insert into 20 schemas/tables and repeat it 40 times with a transaction/row count size of 10,000. That results in 8,000,000 rows of data, first being inserted/updated/deleted into MySQL, then extracted, replicated, and applied to (in this case) Vertica.

For the testing, I then use the start/stop of sequence number controls in the replicator and then monitor the time I start and stop from those numbers.

This gives me stats within about 2 seconds of the probably true result, but over a period of 15-20 minutes, that’s tolerable.

It also means I can do testing in two ways:

  • Start the load test into MySQL and test for completion into Vertica

or

  • Load the data into THL, and test just the target applier (from network transfer to target DB)

For the real-world performance I use the full end-to-end (MySQL insert and target apply) testing

Test Environments

I tested three separate environments, the original MemCloud hosted servers, some bare metal hosts and AWS EC2 hosts:

MemCloud Bare Metal AWS
Cores

4

12

16

Threads

4

12

16

RAM

64

192

122

Disk

SSD

SSD

SSD

Networking

10GB

10GB

25GB

It’s always difficult to perfectly match the environments across virtual and bare metal, particularly in AWS, but I did my best.

Results

I could go into all sorts of detailed results here, but I think it’s better to simply look at the final numbers because that is what really matters:

Rows Per Minute
Memcloud

1900000

Bare Metal

678222

AWS

492893

Now what’s interesting here is that MemCloud is significantly faster, even though there are fewer CPUs and even lower RAM requirements. It’s perhaps even more surprising to note that MemCloud is more than 4.5x times faster than AWS, even on I/O optimized hosts (probably the limiting factor in Vertica applies).

graph1

 

Even against fairly hefty bare metal hosts, MemCloud is almost 3x faster!

I’ve checked in with the engineers on the Bare Metal which seem striking, especially considering these are really beefy hosts, but it may simply be the SSD interface and I/O that becomes a limiting factor. Within Vertica when writing data with the replicator a few things are happening, we write THL to disk, CSV to disk, read CSV from disk into a staging table, then merge the base and staging tables which involves shuffling a lot of blocks in memory (and ultimately disk) around. It may simply be that the high-memory focused environment of MemCloud allows for very much faster performance all round.

I also looked at the performance as I started to increase the number of MySQL sources feeding into the systems, this is to separate schemas, rather than the single, unified schema/table within Vertica.

Sources

1

1

2

3

4

5

Target Schemas

20

40

40

60

80

100

Rows Written

8000000

8000000

16000000

24000000

32000000

40000000

Memcloud

1900057

1972000

3617042

5531460

7353982

9056410

Bare Metal

678222

635753

1051790

1874454

2309055

3168275

AWS

492893

402047

615856

What is significant here is that with MemCloud I noticed a much more linear ramp up in performance that I didn’t see to the same degree within the Bare metal or AWS. In fact, with AWS I couldn’t even remotely achieve the same levels and by the time I got to three simultaneous sources I got such wildly random results between executions that I gave up trying to test. From experience, I suspect this is due to the networking an IOPS environment, even on a storage optimized host.

The graph version shows the differences more clearly:

graph2

 

Bottom line, MemCloud seems really quick, and the statement I made in the original testing still seems to be valid:

The whole thing went so quick I thought it hadn’t executed at all!

Analytical Replication Performance from MySQL to Vertica on MemCloud

I’ve recently been trying to improve the performance of the Vertica replicator, particularly in the form of the of the new single schema replication. We’ve done a lot in the new Tungsten Replicator 5.3.0 release to improve (and ultimately support) the new single schema model.

As part of that, I’ve also been personally looking to Kodiak MemCloud as a deployment platform. The people at Kodiak have been really helpful (disclaimer: I’ve worked with some of them in the past). MemCloud is a high-performance cloud platform that is based on hardware with high speed (and volume) RAM, SSD and fast Ethernet connections. This means that even without any adjustment and tuning you’ve got a fast platform to work on.

However, if you are willing to put in some extra time, you can tune things further. Once you have a super quick environment, you find you can tweak and update the settings a little more because you have more options available.  Ultimately you can then make use of that faster environment to stretch things a little bit further. And that’s exactly what I did when trying to determine how quickly I could get data into Vertica from MySQL.

In fact, the first time I ran my high-load test suite on MemCloud infrastructure, replicating data from MySQL into Vertica, I made this comment to my friend at Kodiak:

The whole thing went so quick I thought it hadn’t executed at all!

In general, there are two things you want to test when using replication to move data from a transactional environment into an analytical one:

  • Latency of moving the data
  • Apply rate for moving the data

The two are subtly different. The first measures how long it takes to get the data from the source to the target. The second measures how much data you can move in a set period of time.

Depending on your deployment and application, either, or both, can be critical. For example, if you are using analytics to perform real-time analysis and charging on your data, the first one is the most important, because you want the info up to date as quickly as possible. If you performing log analysis or longer-term trends, the second is probably more important. You may not worry about being a few seconds, but you want many thousands of transactions to be transferred. I concentrated on the former rather than the latter, because latency in the batch applier is something you can control by setting the batch interval.

So what did I test?

At a basic level, I was replicating data from MySQL directly into Vertica. That is, extracting data from the MySQL binary log, and writing that into a table within HPE Vertica cluster using 3 nodes. Each is running in MemCloud, which each running with 64GB of RAM and 2TB of SSD disk space. I’ve deliberately made no configuration changes to Vertica at this point.

The first thing I did was set-up a basic replication pipeline between the two. Replication into Vertica works by batch-loading data through CSV files into Vertica tables and then ‘materialising’ the changes into the carbon copy tables. Because it’s done in batches, the latency is effectively governed by the batch apply settings, which were configured for 10,000 rows or 5 seconds.

To generate the load, I’ve written a script that generates 100,000 rows of random data, then updates about 70% of those randomly and deletes the other 30%. So each schema is generating about 200,000 rows of changes for each load. This is designed to test the specific batch replication scenario. Ultimately it does this across multiple schemas (same structure). I specifically use this because I match this with the replication to get replication (rather than transaction) statistics. I need to be able to effectively monitor the apply rate into Vertica from MySQL.

The first time I ran the process of just generating the data and inserting into MySQL, the command returned almost immediately. I seriously thought it had failed because I couldn’t believe I’d just inserted 200,000 rows into MySQL that quick. Furthermore, over on the Vertica side, I’m monitoring the application through the trepctl perf command, which provides the live output of the process. And for a second I see the blip as the data is replicated and then applied. I thought it was so quick, it was a single row (or even failed transaction) that caused the blip.

The first time I ran the tests, I got some good results with 20 simultaneous schemas:

  • 460,000 rows/minute from a single MySQL source into Vertica. 

Then I doubled up the source MySQL servers, so two servers, 40 simultaneous schemas, and ultimately writing in about 8 million rows:

  • 986,000 rows/minute into Vertica across 40 schemas from 2 sources

In both cases, the latency was between 3-7 seconds for each batch write (remember we are handling 10,000 rows or 5s per batch). We are also doing this across *different* schemas at this stage. These are not bad figures.

I did some further tweaks, this time, reconfiguring the batch writes to do larger blocks and larger intervals. This increases the potential latency (because there will be bigger gaps between writes into Vertica), but increases the overall row-apply performance. Now we are handling 100,000 rows or 10s intervals. The result? A small bump for a single source server:

  • 710,000 rows/minute into Vertica across 20 schemas from 1 source

Latency has increased though, with us topping out at around 11.5s for the write when performing the very big blocks. Remember this is single-source, and so I know that the potential is there to basically double that with a second MySQL source since the scaling seems almost linear.

Now I wanted to move on to test a specific scenario I added into the applier, which is the ability to replicate from multiple source schemas into a single target schema. Each source is identical, and to ensure that the ‘materialise’ step works correctly, and that we can still analyse the data, a filter is inserted into the replication that adds the source schema to each row.

The sample data inserts look like this:

insert into msg values (0,"RWSAjXaQEtCf8nf5xhQqbeta");
insert into msg values (0,"4kSmbikgaeJfoZ6gLnkNbeta");
insert into msg values (0,"YSG4yeG1RI6oDW0ohG6xbeta");

With the filter, what gets inserted is;

0, "RWSAjXaQEtCf8nf5xhQqbeta", "sales1"
0,"4kSmbikgaeJfoZ6gLnkNbeta", "sales1"

Etc, where ‘sales1’ is the source schema, added as an extra column to each row.

This introduces two things we need to handle on the Vertica side:

  1. We now have to merge taking into account the source schema (since the ID column of the data could be the same across multiple schemas). For example, whereas before we did ‘DELETE WHERE ID IN (xxxx)’, and now we have to do ‘DELETE WHERE ID IN (xxxx) AND dbname = ‘sales1”.
  2. It increases the contention ratio on the Vertica table because we now effectively write into only one table. This increases the locks and the extents processed by Vertica.

The effect of this change is that the overall apply rate slows down slightly due to the increased contention on a single table. Same tests, 20 schemas from one MySQL source database and we get the following:

  • 760,000 rows/minute into Vertica with a single target table

This is actually not as bad as I was expecting when you consider that we are modifying every row of incoming data, and are no longer able to multi-thread the apply.

I then tried increasing that using the two sources and 40 schemas into the same single table. Initially, the performance was no longer linear, and I failed to get any improvement beyond about 10% above that 760K/min figure.

Now it was time to tune other things. First of all, I changed some of the properties on the Vertica side in terms of the queries I was running, tweaking the selecting and query parameters for the DELETE operations. For batch loading, what we do is DELETE and then INSERT, or, in some case, DELETE and UPDATE if you’ve configured it that way. Tweaking the subquery that is being used increased the performance a little.

Changing the projections used also increased the performance of the single schema apply. But the biggest gains, perhaps unsurprisingly, were to change the way the tables were defined in the first place and to use partitions in the table definition. To do this, I modified the original filter that was adding the schema name, and instead had it add the schema hash, a unique Java ID for the string. Then I created the staging and base tables in Vertica using the integer hash as the partition. Then I modified the queries to ensure that the partitions would be used effectively.

The result was that the 760k/min rate was now scalable. I couldn’t get any faster when writing into a single schema, but the rate remains relatively constant whether I am replicating five schemas into a single one, or 40 schemas from two or three sources into the same. In fact, it does ultimately start to dip slightly as you add more source schemas.

Even better, the changes I’d made to the queries and the overall Vertica batch applier also improved the speed of the standard (i.e. multi-schema) applier. I also added partitions to the ID field for too to improve the general apply rate. After testing for a couple days, the average rate:

  • 1,900,000 rows/minute from a single MySQL source into Vertica.

This was also scalable. Five MySQL sources elicited a rate of 8.8 million rows/minute into Vertica, making the applier rate linear with a 1% penalty for each additional MySQL source. The latency stayed the same, hovering around the same level as before of around 11s for most of the time. Occasionally you’d get a spike because Vertica was having trouble keeping up, but

Essentially, we are replicating data from MySQL into Vertica for analytics at a rate I simply called ‘outstandingly staggering’. I still do.

The new single schema applier, database name filter (rowadddbname) and performance improvements are all incorporated into the new Tungsten Replicator.

Data Migration: Handling Data Types

Handling the structure, and deciding which fields, data types, or data points, relate to which fields or target elements within the destination database are one thing. Normally you can make an intelligent decision about the information that is being transferred and how that structural information should be handled.

The actual data can be a completely different problem. There are so many potential problems here with simply parsing and understanding the data that we need to look at some of the more common issues and how they should be addressed.

For most cases, the content and structure of the individual types is about understanding the two dimensions of the problem

  • Supported types – that is, whether the target database understands, or even identifies the underlying type. For example, within most RDBMS environments, most data types are very strictly enforced, but in NoSQL or datawarehouse environments they corresponding engines may either not care, or only care in specific situations.
  • Parseable information – the information may be readable or storable, but not to the same precision or definition. For example, with time values, some store hours, minutes and seconds, some store hours and minutes, and a whole variety store different lengths of fractions of seconds, to one or even 15 points of precision. At the end of the same spectrum, some dates are represented by seconds, others by a string in various formats.

Let’s look at the finer details of understanding these different types and how they can be stored, exchanged and ultimately parsed.

Basic Types

There are four basic types that make up all data and that, on the whole, get represented within a database system. Identifying these four types will help us to set the groundwork for how we handle, identify and treat the rest of the data:

  • Numeric – basically any kind of number or numeric value, from integers through to floating point and even ‘Big’ numbers.
  • Strings – Strings seem like a very simple list of characters, but they aren’t as straightforward as you might think.
  • Dates – Dates are a special type all of their own. Although they look like the worst parts of numbers and strings combined, ensuring that they are identifiable as a date in the target database relies on understanding the different formats, separators and structures.
  • Binary – any type of data that does not fall into the previous three groups is binary data. Here the issue is one of identity and the ability to interpret the content once it reachs the destination database.

Remember that in all of these situations, we are not only talking about one time formatting of the information. We could be dealing with frequent and regular exchanges of this information between the databases, and even need to perform these changes and differences regularly if the data is integrated across multiple database environments.

When combining, for example, MongoDB data with Oracle information for the processes of reporting, you need to do more than change the format once. It needs to be in a common representable format for both databases throughout the life of the data, while simultaneously ensuring that the information is best stored within each database to get the performance you need.

Strict and Relaxed Translation

Whenever you are moving data between different databases you need to determine whether the type and structure of information is important enough, or critical enough, that it must be represented in its native format.

That may sound like a completely nonsensical approach to data – surely the quality of the data is absolutely critical and should be represented as such everywhere? In theory it should, but different database storage environments treat and handle the data in different ways according to the basic type in question.

To understand why this is important, we need to look back both historically and technically why information was stored in the strict formats we described in the last section.

In any old, and particularly RDBMS-based database solution, data was stored into fixed types and with fixed lengths so that the record could be manipulated as efficient as possible. We saw some examples of this in Chapter 1. For numerical values, it is much more efficient to store a 32-bit integer as just 4 bytes of data than it is to store the string 2147483647 (which would take 9 bytes).

Similarly, with string types, the primary consideration has always been to minimize the amount of storage reserved for the string because handling bigger strings, or bigger potential blocks for strings, was more expensive in computing time, memory space, and disk space. Back when databases ran on machines with 512KB of RAM, devoting massive blocks of memory to non-usable space allocated but not used to store data just wasn’t an option. This is why 8 character filenames and two or three letter codes for a variety of databases and storage methods became common.

In modern systems of course, we actually have almost the opposite problem. Data sizes are now regularly so large that we need to be prepared to handle massive blocks of information whereas before that might have been impossible. This is fine when we are moving data from a traditional RDBMS to say Hadoop, because we move from a strict environment to a very relaxed one. But when moving in the opposite direction this is not true.

To make matters worse, in many Big Data environments, including most of the Hadoop database layers like Hive, the datatype is only significant at the time the data is queried. Within Hive you can load a CSV file that contains a variety of different types, but unless you explicitly tell Hive that the fifth column is a string or a number, Hive doesn’t really care, and let’s you export and query the data as normal.

For example, here’s a table with a mixture of different column types, here using the strict datatypes to define the actual content for each column:

hive> select * from stricttyping;
OK
1      Hello World   2014-10-04 12:00:00   439857.34
1      Hello World   2014-10-04 12:00:00   157.3457
1      Hello World   2014-10-04 12:00:00   4.8945796E7

Now we can view the same data, this loaded into a table where each column has been defined as a string:

hive> select * from relaxedtyping;
OK
1      Hello World   2014-10-04 12:00:00   439857.345
1      Hello World   2014-10-04 12:00:00   157.3457
1      Hello World   2014-10-04 12:00:00   48945797.3459845798475

The primary differences are in the handling of floating point values – the top strict table loses precision (the value was a FLOAT), and at the bottom the value is represented as a DOUBLE with a loss of precision digits. In fact, within Hive, the data is not parsed into the corresponding type until the data is used or displayed. If you examine the raw CSV file:

$ hadoop fs -cat stricttyping/sample_copy_1.csv
1,Hello World,2014-10-04 12:00:00,439857.345
1,Hello World,2014-10-04 12:00:00,157.3457
1,Hello World,2014-10-04 12:00:00,48945797.3459845798475

In fact, many people deliberately don’t explicitly load the data into fixed type columns; they define the column types as strings and then import the date and ultimately ignore the real type until they have to parse or understand it for some reason.

Similarly, in NoSQL environments, the data types may really only be for explicitly representation requirements, and have no effect on the ability to either store or display and query the information. Even in a traditional RDBMS, there is no requirement to explicitly store certain values in certain column types, but certain operations may be limited. For example, most RDBMSs will not perform a SUM() operation on a string column.

The bottom line is that you will need to think about whether to explicitly make use of these columns because you need them as specific types in the target database, or whether to ignore them completely.

  • Strict transformations – Use strict datatypes when the information you want to store must be correctly interpreted within the target database, and it provides some form of performance advantage, unless doing so reduces the validity or precision of the information.
  • Relaxed transformations – Use relaxed transformations whenever the processing or target system does not support the required precision, or in those cases where the processing of the information is irrelevant. Most t ransfers to NoSQL and Big Data environments fit this model automatically.

With this options to you in mind, let’s look at some of the more specific types available.

Handling Numeric Values

Simple, plain, integers are supported by nearly all databases as explicit and identifiable types. Even document databases such as MongoDB and Couchbase understand the significance of a numeric value over a string representation.

However, if you are transferring big integers, be conscious of the limitations of the target database. Some environments explicitly support very large integers. Hive, for example, supports the BIGDECIMAL datatype, which holds numbers with up to 10 to the power of 308. Others do not.

Floating Point Issues

The biggest problem with floating point values is one of precision and storage capability. There are large variations between the supported types, how much is stored and how precise it can be. Further more, some databases specifically differentiate between decimal and floating point values and have different rules for how these should be represented and stored, and the two are not necessarily compatible

For floating-point values, the main issues are:

  • Representation – float values are generally displayed as a decimal value, for example:
3247234.32434
  • There are no specific rules for this, but many values are based on the support of the operating systems own data type support. On modern 32-bit (and 64-bit) systems, floating-point values tend to have 7 digits of precision after the decimal point. This is due to the nature of the structure used to store and define them internally. A double has twice the precision, up to 15 or even 16 digits past the decimal point.
  • Parsing – these values properly is critical if you are storing the data; unfortunately rounding-errors, both made when the data is output, and when it is parsed back, are notoriously difficult, and not always religiously honoured.
    For this reason, some database explicitly support a DECIMAL type. Unlike the float, the DECIMAL type works more like two integers either side of the decimal.

Processing these values reliably, and storing them in a target database may lead to problems if the target system doesn’t support the datatype size, precision, or structure properly. Moving the data may lose the precision or content. On a simple movement of the data in an export/import environment might parse or store it correctly, or it may lose or truncate the precision entirely.

If you are replicating and regularly exchanging data from one database to the other and back again, these precision errors can build up to translate and convert a number from one value to one statistically significant.  If the double type within the databases environment does not support the complexity or precision of the values involved, consider using one of the big integer types and a suitable multiplier.

Finally, if the target database does not support the precision and detail that you need, consider moving the data using relaxed methods, for example by importing the data into a string, rather than a numerical type so that it can be reliabily stored.

Base-N Numbers

If you are exchanging numbers in other than base 10, for example, octal, hexadecimal, or others, ensure that the target database supports the required number format. If an explicit number format is not supported, either translate the number to decimal and handle the output and formatting of the data as the corresponding type within the target database and application, or use the relaxed method and keep it as a string.

Strings and Character Encoding

More problems are created by strings than you might think, and the most significant is usually the character set used to store, transfer, and import the data. Character sets used to refer to the difference between the byte-level encoding for things like EBCDIC and ASCII. Today, they span a much wider array of issues as the number of character sets and the use of a wider range of languages, characters, and ideographs increases.

The best way to encode strings when moving the data between databases is to use either UTF-8 (which encodes Unicode character in 8-bit bytes) or one of the high-bitrate encodings if your data requires it. For example, if you are specifically storing foreign-language, katana, or Chinese characters, using UTF-16 or UTF-32 may be more reliable, if not necessarily more efficient. UTF-8 can be used for a very wide range of different Unicode characters and is rarely a hindrance.

Also be aware that some databases identify character encoding capabilities and data types differently. For example, the VARCHAR2 type within Oracle can be used to store strings with an optional size (byte or character) declaration, but the NVARCHAR2 type is the Unicode (byte) sized datatype. The definition of the column and size can also be different. In Amazon RedShift for example, the size of VARCHAR column is defined in bytes, but in MySQL it’s defined in characters, so a VARCHAR(20) in MySQL has to be a VARCHAR(80) in RedShift. Sneaky.

A secondary issue is one of storage space. Different database environments support different representations of the different character storage types, and sometimes have wildly different performance characteristics for these different types.

Within a NoSQL or Big Data environments, the length (or size) of the string is rarely a problem, as they don’t have fixed or strict datatypes. However, for most RDBMS environments there are specific lengths and limits. Oracle supports only 4000 bytes in VARCHAR2 for example; MySQL supports 255 bytes in a CHAR, or 65535 bytes in a VARCHAR.

Finally, when transferring the information you may need to pay attention to any delimiters. Using CSV, for example, and using quotes to define the field limits only works when there aren’t quotes in the field content.

Dates and Times

Of all the different data types that we have covered up to now there have been problems with understanding and parsing the values because of differences in the types, format, or structure of the data, but all of them were largely covered within some simple limits and structure.

Dates are another problem entirely. In particular:

  • Date precision and format
  • Time precision and format
  • Dates or Epochs?
  • Time Zones

All go together to make for one of the most complicated of the all the types supported when transferring data, because there are so many times where it can go wrong.

Epochs

Epoch values are those where the data is represented as an integer counting, usually, the seconds from a specific reference point in time, from which the current date can be calculated. For example, Unix-based Epoch times are represented as the number of seconds that have elapsed since Jan 1st 1970 at 00:00:00 (12:00am) GMT. Counting forward from this enabels you to represent a date. For example, the value:

1413129365

Is in fact 12th October 2014.

There are two issues that arise from Epoch dates, time drift and date limits.

Time drift occurs if the date has been stored as an epoch that is relative to the current timezone. This can actually happen more frequently than you realize if dates are reconstituted back to an Epoch from a local time based balue into an Epoch. For example, some libraries that parse a date without an explicit timezone will assume that the date is within the current timezone of the system.

This is a particularly thorny problem when you realize that epochs have no timezones of their own. This means that the Epoch value:

1413129365

Is 15:56 BST, but 07:56 PST. If you now transfer a PST-based epoch to GMT and then use it without conversion, all your times will be out by 8 hours. If you ran batch jobs on the imported data at 1am, that time would actually refer to a completely different day.

If you must use epoch values, ensure that you either know what the timezone was, or adjust the value so that it is against GMT and you can translate to the appropriate timezone when you need to. Also see the secion on timezones below.

The secondary problem is date limits. Traditionally epoch values were stored as 32-bit integers, which limits the date between 1970 and 2038. While this is fine for current times (at least for the next 24 years or so), for any future dates, this can be an issue.

If you are porting epoch values to a target that only supports 32-bit dates, and the date is beyond 2038, don’t transfer it using the epoch value, translate it into an explicit date that can be parsed and stored in whatever local format is required for the target environment. For example, within MySQL you can use the FROM_UNIXTIME() function to translate your epoch date into something more usable.

Date Formats

When transferring dates, use a format that is unambiguous and supported by the target system. Different locations and systems have different ways of representing dates, including the different seaprators that are used, and the different orders of the components. Even the use of the prefix for some numbers differs between regions. Some examples are shown in the table below.

Location/Format Example
USA Month.Day.Year
Japan Year-Month-Day
Europe Day.Month.Year
UK Day/Month/Year

Different locations and date formats

The best format to use is usually the ISO format:

YEAR-MONTH-DAY

With a zero prefix added to each value to pad it to the correct number of characters. For example, the the 1st of January:

2014-01-01

Or the year 1:

0001-01-01

The ISO format is not only readable on just about every single platform, it also has the advantage of being sortable both numerically and by ASCII code, making a practical way of exporting and loading data in date order without having to explicitly order data by dates.

Time Formats

Time is usually restricted to a fairly obvious format, that of:

Hours:Minutes:Seconds

Or in some regions and standards:

Hours.Minutes.Seconds

Aside from the timezone issue, which we will look at next, the other problem is the level of precision. Some databases do not support any precision beyond seconds. For example, within Oracle you can store precision for eseconds up to 9 decimal points. Amazon RedShift supports only 6 digits of precision.

Also be aware that some environments may not support explicit date and time types, but only a unified datetime or timestamp type. In this case, the structure can be even more limited. For example, within Amazon RedShift, the timestamp datatype is actually formatted as follows:

YYYYMMDD HH:MM:SS

With the date in ISO format but without explicit date separators.

Time Zones

Every time represented everywhere is actually a time within a specific timezone, even if that timezone is UTC (Universal Time Coordination). The problem with timezones is that the timezone must either be explicitly stored, shared, and represented, or it should be stated or understood between the two systems that the time is within a specific known timezone. Problems occur either when the timezone is correctly represented, or assumptions are made.

For example, the following time:

2014-09-03 17:14:53

Looks clear enough. But if this has come from the BST (British Summer Time) timezone and gets imported into a database running in the IST (India Timezone) then you start to get the time stored in the wrong format if the timezone is not explicitly specified.

Another issue is when there are timezone differences when data is transferred, not because of the physical time difference, but because of the effect of daylight savings time. Transferring data from, say, GMT to PST is not a problem if you know the timezone. Transfer the data over during a daylight savings time change, and you can hit a problem. This is especially true for timezones that have different dates for when daylight savings time changes.

Finally, be prepared for databases that simply do not support the notion of timezones at all. To keep these databases in synchronization with other databases with which you might be sharing information, the easiest method is to use GMT.

In general, the easiest solution for all timezone related data is to store, transfer, and exchange the data using the UTC timezone and let the target database handle any translation to a localized timezone. If you need to explicitly record the timezone – perhaps because the data refers to a specific timezone as part of the day – then use the time type that supports it, or store a second field that contains the timezone information.

Compound Types

We’ve already looked at some of the issues in terms of the structural impact of compound types. Even if you have the ability to represent a value as a compound structure within your target data, you need to understand the limitations and impact of compound types, as not all systems are the same. Some of these limitations and effects are database specific, others are implementation specific.

For example, within MySQL and PostgreSQL, the ENUM type enables you to store a list of fixed string-like values that can be chosen from a fixed list. For example:

ENUM(‘One’,’Two’,’Three’,’Four’)

The benefit of this from a database perspective is that internally each string can be represented by a single number, but only reconstituted into the string during output. For targets that do not support it, therefore, the solution is to translate what was an ENUM column in MySQL into a string in the target database.

MySQL also supports the SET type, which is similar to ENUM, except that the value can refer to muiltiple options. For example:

SET(‘Mon’,’Tue’,’Wed’,’Thu’,’Fri’,’Sat’,’Sun’)

The SET type enables you to record not only the specific day, but maybe a group of days, for example:

INSERT INTO table VALUES (‘Mon,Wed,Fri’)

Again, interally this information is represented this time as a BIT, and so the actual data is implied and displayed as a compound type.

When translated to a database that doesn’t support the type, you may either want to create an additional column for each value to store it, or, if you are using a document database, you may want the set firled converted to an array or hash of values:

{
    ‘Mon’ => 1,
    ‘Wed’ => 1,
    ‘Fri’ => 1
}

Always consider how the data was used and will be searched on either side of the database transfer. In an SQL RDBMS queries normally take the form:

SELECT * FROM TABLE where FIND_IN_SET(‘Mon’,days)>0;

That is, return all the values where the field contains the value ‘Mon’. In a database that supports searching or indexing on individual values (MongoDB, Couchbase), the key-based transfer, where we effectively set the member of a hash to a meaningless value so that we can do key-based lookups. We’ll examine this in more detail when we examine the rnvironments of these databases.

Serialized and Embedded Formats

For a whole variety of different reasons, some people store more complex formats into their database so that they can bmore easily be manipulated and used within the depper element sof their application.

For example, serializing an internal structure, for example, a Perl object or a Java object so that it can be stored into a field or BLOB within the database is a good way of making use of complex internal structures and still have the ability to store and manipulate the the more complex data within the application environment.

If all you want is to transfer these the serialized format from one database to another, then the basics are unlikely to change. You may need to use the binary translation methods in the next section to realisitically get the data over into the new database reliably, but otherwise, the transfer should be straightforward.

However, there is also the possibility that you want to be able to query or extract dta that may have been embedded into the serialized object.

In this case, you need to change the way that you use and manipulate the information as part of the data migration process. In this case, you may want to take the information and either expand the data to expose the new fields as transferrable data.

Or, you may more simply want to change the content of the information from its serialized format into a more universal format, such as JSON.

Binary and Native Values

Binary data, that is, data that is explicitly stored and represented in a binary format is difficult to process when moving data.

  • Binary means that single-character delimiters become useless, even control characters. 0x01 is just as likely to come up in binary data as it is when used as a field separator.
  • Pure, native, binary data suffers from the problems of ‘endianess’, that is, the byte order of the individual bytes. Test and numerical translations don’t tend to suffer from this because systems know how to parse text. When exchanging binary data, the endianness of the data applies.

Binary data can also be affected by any translation or migration process that is expecting a string representation of information. For example, it is not uncommon for UTF8 to be used when reading binary data, which leads to interpretation and storage problems.

In general, the best advice for true binary information is for the data to be encoded into one of the many forms of binary-to-hex translation formats. This can include solutions such as raw hex conversion, where the data is quite literally expanded to a two-character hex string for each binary byte. For example, we can translate any byte strinf into hex values with tools like Perl:

$ perl -e "print unpack('H*','Hello World')"
48656c6c6f20576f726c64

Or use uunencode:

begin 666 HelloWorld
,2&5L;&@5V]R;&0*
`
end

Or use the MIME64 standard that is employed in many modern email and Web environments for transferring attachments, as it ensures that even multi-byte cahracters are effectively transferred.

All of these solutions can be easily processed on the other side back into the binary format according to the endianess of the host involved.


Data Migration: Mapping the Data

When moving the data between different databases the primary considering is what that’s going to look like so that it can be used in the target environment. Later chapters are going to dig deeper into this topic, but let’s fly over some of the key considerations here.

Mapping Columns to Tables

If we were replicating this data from our existing RDBMS into another, the most obvious method is for us to simply move the tables wholesale from one environment to the other. If they both support table structure, then there is no reason not to duplicate this structure on the other side.

 

But, always be conscious of how the data is going to be handled over on the other side. If your target database does not support joins between tables, as some Hadoop alternatives do not, then you will need to determine whether you are better to merge the table data together, either into a bigger table and either duplicate the information, or hide it.

For example, if you wanted to analyse which recipe has the most contents made of chicken, then you could combine the tables together from the transactional side into a pre-joined table that contains the detail. For example, converting our three tables, ingredients, recipe ingredients, and recipes, into something like this:

|     3006 | Tabboule            | <olive oil>         |
|     3006 | Tabboule            | <salt>              |
|     3006 | Tabboule            | <onion>             |
|     3011 | Minted pea puree    | <olive oil>         |
|     3011 | Minted pea puree    | <frozen peas>       |
|     3011 | Minted pea puree    | <seasoning>         |
|     3011 | Minted pea puree    | <butter>            |
|     3012 | Lamb patties        | <ground coriander>  |
|     3012 | Lamb patties        | <ground cumin>      |
|     3012 | Lamb patties        | <ground turmeric>   |

Now we have the information from the ingredients table merged with the recipe ID and title from the souce database. Over in our destination store, performing a count or summation operation will now be an easier way to enable us to do the query. In fact, with a single table structure like this some operations are quicker and provide the information we want. For example, find all the recipes with frozen peas in them is a single (possibly indexed) table:

+----------+--------------------+---------------+
| recipeid | title              | description   |
+----------+--------------------+---------------+
|      984 | Waffle fish pie    | <frozen peas> |
|      633 | Vegetable korma    | <frozen peas> |
|       27 | Spicy tuna stew    | <frozen peas> |
|     1261 | Seafood paella     | <frozen peas> |
|      902 | Choux au gratin    | <frozen peas> |
|      866 | Tomato baked rice  | <frozen peas> |
|     1971 | Spicy risotto star | <frozen peas> |
|     2741 | Cheat's jambalaya  | <frozen peas> |
|     2750 | Spicy sausage rice | <frozen peas> |
|     2778 | Quick jambalaya    | <frozen peas> |
|     3011 | Minted pea puree   | <frozen peas> |
+----------+--------------------+---------------+

In a columnar store this can be orders of magnitude faster than a join across the three source tables, and still provides us with the core of information we want to display – the recipe id and title.

Mapping Columns to Documents

Moving the data over verbatim as tables is unlikely to work as efficiently as you think. For example, in a NoSQL database, joins are normally either impossible, or computationally expensive, and so expecting to be able to reconstitute the structure in the destination database. You also want to make sure that you are using the most efficient method for the database system. Simply putting a JSON representation of each row from an existing table or column store is probably not going to work, especially as JOINs are typically unavailable. A better solution is merge the data from multiple tables itno a single document that contains all of the information you need.

Doing this for table data to be inserted into a document store normally requires the composition of a document from the constituent elements, in the case of our recipe database, the recipe, method, ingredients, and nutritional information needs to be combined together into one big document. There are a variety of ways to do this, but an obvious solution is to logically group ‘objects’ together. That is, an object that might be represented by a collection of tables. Like this:

Fig0201.png

Within our recipe data, for example, in a document store the use case is for us to extract or remove the entire recipe – base data, ingredients, and methods – as a single document that contains all the information we need. This puts all of the information in one document, and makes it easy to update and format as that entire recipe at a time. We can actually see a sample of this, first by looking at the diagrammatic example, here with some dummy data, but you can see how the tables on the right can be mapped to fragments of the document on the left.

Document and Table Mapping

We can also look at a simple script that performs this operation for me, here collecting the recipe object (which queries the underlying database) and then converting that into a JSON structure for writing into the database:

use JSON;
use lib 'cheffy.com';
use Foodware;
use Foodware::Public;
use Foodware::Recipe;

my $fw = Foodware->new();

my $recipes = $fw->{_dbh}->get_generic_multi('recipe','recipeid',{ active => 1});


foreach my $recipeid (keys %{$recipes})
{
    my $recipe = new Foodware::Recipe($fw,$recipeid,{ measgroup => 'Metric',
                                                 tempgroup => 'C',});

    my $id = $recipe->{title};
    $id =~ s/[ ',()]//g;
    my $record = {
       _id => $id,
       title => $recipe->{title},
       subtitle => $recipe->{subtitle},
       servings => $recipe->{servings},
       cooktime => $recipe->{metadata_bytag}->{totalcooktime},
       preptime => $recipe->{metadata_bytag}->{totalpreptime},
       totaltime => $recipe->{metadata_bytag}->{totaltime},
       keywords => [keys %{$recipe->{keywordbytext}} ],
       method => $recipe->{method},
    };

    foreach my $ingred (@{$recipe->{ingredients}})
    {
       push(@{$record->{ingredients}},
            {
               meastext => $ingred->{'measuretext'},
               ingredient => $ingred->{'ingredonly'},
               ingredtext => $ingred->{'ingredtext'},
            }
           );
    }

    print to_json($record);
}

Finally, let’s look at how this is all translated into a full JSON representation of the same information:

{
   "subtitle" : "A good use for bananas when they're going soft.",
   "keywords" : [
      "diet@vegetarian",
      "diet@corn-free",
      "occasion@entertaining",
      "diet@citrus-free",
      "occasion@kids' parties",
      "diet@demi-veg",
      "special collections@lunchbox",
      "meal type@cakes, biscuits, sweets",
      "special collections@classic recipe",
      "diet@peanut-free",
      "cook method.hob, oven, grill@oven",
      "special collections@store cupboard",
      "diet@yeast-free",
      "special collections@budget",
      "occasion@prepare-ahead entertaining",
      "main ingredient@fruit",
      "occasion@picnic",
      "diet@shellfish-free",
      "special collections@cheffy recommended"
   ],
   "preptime" : "20",
   "servings" : "8",
   "cooktime" : "45",
   "method" : [
      {
         "_sort" : "4",
         "recipeid" : "2035",
         "step" : "4",
         "altgroup" : "0",
         "methodstep" : "Spoon into the loaf tin. Spoon the top. Bake for 45-50 min or until well risen and cooked through. ",
         "text_formatted" : "Spoon into the loaf tin. Spoon the top. Bake for 45-50 min or until well risen and cooked through. "
      },
      {
         "text_formatted" : "Slowly beat in the egg. Add the banana. Fold in the flour and bicarbonate of soda. Add the dried fruit. ",
         "methodstep" : "Slowly beat in the egg. Add the banana. Fold in the flour and bicarbonate of soda. Add the dried fruit. ",
         "_sort" : "3",
         "recipeid" : "2035",
         "altgroup" : "0",
         "step" : "3"
      },
      {
         "recipeid" : "2035",
         "_sort" : "2",
         "step" : "2",
         "altgroup" : "0",
         "text_formatted" : "Cream the butter and sugar together until pale and fluffy. ",
         "methodstep" : "Cream the butter and sugar together until pale and fluffy. "
      },
      {
         "recipeid" : "2035",
         "_sort" : "1",
         "altgroup" : "0",
         "step" : "1",
         "text_formatted" : "Preheat oven to 180&deg;C. Grease a 900 g loaf tin.",
         "methodstep" : "Preheat oven to 180.C. Grease a 900g loaf tin."
      },
      {
         "text_formatted" : "Turn out onto a wire tray. Leave to cool. ",
         "methodstep" : "Turn out onto a wire tray. Leave to cool. ",
         "_sort" : "5",
         "recipeid" : "2035",
         "altgroup" : "0",
         "step" : "5"
      }
   ],
   "totaltime" : "65",
   "_id" : "Bananacake",
   "title" : "Banana cake",
   "ingredients" : [
      {
         "ingredtext" : "butter",
         "meastext" : "75 g",
         "ingredient" : "butter"
      },
      {
         "ingredtext" : "bicarbonate of soda",
         "meastext" : "[sup]1[/sup]/[sub]2[/sub] tsp",
         "ingredient" : "bicarbonate of soda"
      },
      {
         "meastext" : "2",
         "ingredient" : "bananas",
         "ingredtext" : "ripe bananas, peeled and mashed"
      },
      {
         "ingredtext" : "white self-raising flour, sifted",
         "ingredient" : "white self-raising flour",
         "meastext" : "200 g"
      },
      {
         "ingredtext" : "salt",
         "meastext" : "1 pinch",
         "ingredient" : "salt"
      },
      {
         "ingredtext" : "egg, beaten",
         "ingredient" : "egg",
         "meastext" : "1"
      },
      {
         "ingredient" : "dried mixed fruit",
         "meastext" : "100 g",
         "ingredtext" : "dried mixed fruit"
      },
      {
         "ingredtext" : "caster sugar",
         "ingredient" : "caster sugar",
         "meastext" : "100 g"
      }
   ]
}

Looking at the output, you can see how the structure of document has been merged together into something more usable. We have a block for ingredients, keywords, method, and you can see how the ‘ingredient’ field in the ingredient block could be used as a searchable element.

The primary questions about the format come will come down to how the data will be used. In CouchDB and Couchbase for example a map/reduce like process will be used to create an index on the information. Choosing the right structure is therefore about understanding the structure and how it will be used. When I come to build an index on this information, and I want to build an index so that I can query by ingredient, is this an effective method to format the data? Is processing all of that data to determine the occurrences of turkey the most efficient method?

How about if I realize that all my vegetarian recipes are really vegan, will that change the structure? We’ll return to these questions later in the book.

Handling Compound Assignments

While we’re talking about constructing documents from tables and vice versa, I want to consider those pesky compound values again. Compound values are hugely complex when it comes to data translations because there are differences within a database type as well as between types that should be considered. Always with a compound type you should start by asking three basic questions:

  • How will it be queried in the target database?
  • Can I have multiple values within the same field?
  • Is the order of those multiple values significant?

One translation you want to avoid is to convert this relatively structured format into something that ultimately becomes hard to process. For example, the temptation is to convert this ‘field’ from the source CouchDB environment into a format that looks similar to the original, for example, by using a comma-separated list:

Indian,Chicken,Okra,Spicy

There are so many dangers with this, it’s hard not to see how this is a bad idea. Some obvious problems are:

  • How do we cope with an ever-increasing number of potential options? We’re showing just four here, what if there were 20? 30? 60?
  • What happens if more data than the width of the column are defined? If it’s truncated we lose information. Using a wider column only works for so long.
  • Indexes become unmanageable from a database adminisitration perspective.
  • How is integrity enforced? Should the options be sorted alphabetically? What happens when they get updated?
  • What happens if we want to change a value? What happens if ‘red’ becomes ‘scarlet’, do we change every row?

The biggest problem with this approach is the usability once that value is used in a transactional or columnar store, the data now becomes difficult and potentially expensive to process. It takes a lot more computational effort for a database to search even with the use of indexes a text string that contains a sequence of letters, against either a join or boolean set of columns matching the same structure.

You can actually test this quite effectively by creating sample tables that emulate this basic functionality. MySQL is being used here to do the comparisons, but the general effects are quite clear. Here’s a search against a single text column using commas:

SELECT id,title FROM recipes WHERE keywords LIKE '%CHICKEN%';

Now here’s the same query where the colours are represented a boolean columns:

SELECT id,title FROM recipes WHERE kw_chicken = 1;

A boolean index in the latter example will be much quicker than the LIKE search in the former.

Adventures in recipes

When developing an application, and particularly the database that will support it, there comes a time when you realize you may not have planned and identified all of the problems you could, and that’s when you notice that something doesn’t quite work.

The use of free-text for ingredients in the recipe database was one such moment. Searching for a recipe that contains ‘Chicken’ is fine if you look at the ingredients, you get to pick up everything from ‘whole chicken’ to ‘chicken breasts’ within the search. Unfortunately, you also pick up ‘chicken stock’. When a user searches for chicken recipes, chicken stock is used in a surprising number of recipes that otherwise contain no chicken of any variety whatever.

When migrating data around, you can see the same principles at work; if you’d relied on using a text field to store the value over separate, stricter, fields, the quality of the data is ruined. Keep this in mind when moving data around.

The recommendations for how to work out which is the best method are actually comparatively straightforward:

  • If the list of possible values is small (for example, four or five different values) and there are few of them within the overall ‘table’ that they are been moved to, use fields/boolean values.
  • If the list is larger, and likely to grow and expand with new options, then use a lookup table.

When applying the same principles the other way round, you should use the most appropriate format for the corresponding target database in a way that makes the data usable. In document databases, for example, it’s efficient to translate a compound value into an array of sub-values, just as in our original.

Mapping Documents to Columns

When mapping data from a document-based structure back into columns, the process is significantly harder. You are specifically going from a flexible multi-field format where the number and type of fields could be entirely different into one where the format is rigid and inflexible.

There are two choices available:

  • Transform the data back into a table structure that as close as possible matches the overall data structure of the underlying information. This is great if the document is relatively flat. But if, as we’ve seen, we have compound data, or variable length compound information, this method doesn’t work as well.
  • Transform the data back out into a multi-table format that has the capability for joins. This is basically the reverse of the process we just examined for converting the table-based recipe data into documents. You must remember to use a unique identifier for the parent record so that it can be linked back properly when a JOIN is used.

The flatter the document, the easier the conversion.


Data Migration: Moving the Actual Data

There are two key elements to the exchange of any information between databases. One is the data structure used for the exchange, and the other is the transformation required to reach those structures.

Some of these are driven by the source database, others by the target database. For example, when moving data from RDBMS to NoSQL database generally requires constructing documents from what might be tabular, or joined-tabular data. This may involve both join elements on the relational side, as well as formatting on the NoSQL side. The eventual aim is to ensure that the data reaches the target database in both the right format, and without corruption, and also in a format that is most appropriate or efficient. That ultimately depends on what you are using the transferred data for.

The other aspect is the difference between source and target data types – that is the format and construction of the individual fields or components of the data. Document databases and Big Data stores tend not to care about the data type, whereas RDBMS cannot live without them.

Some important considerations for how we use this information:

  • The data structure must be retained (i.e., we must be able to identify the fields, columns or other elements of the data).
  • The data format and integrity must be maintained (the data should not be corrupted, shortened or reduced in any way).
  • The data must be able to be efficiently transferred (sending a 1GB file that only contains 15KB of valid information is not efficient).

In this chapter we’ll examine some of the key differences and problems with transferring data that transcend the mechanics of the process, and how to deal with them effectively. Although the three primary aspects, basic formatting, structural comparisons and data type limitations are handled separately here, there are few occasions when you can truly treat these elements individually. We’ll see some examples of this as we go through.

Basic Interchange Formats

When you start to move data efficiently between the different database types that you exist you will find a number of different possible interchange formats, and the primary issue with all of them is exactly how efficiently, and more importantly accurately, they enable the information to be exchanged.

First, the data will need to have been encoded with some kind of encapsulation format. This format is what describes the individual structure, and is entirely dependent on the underlying data that is being exchanged. For example, if the data has a very rigid structure then that will obviously normally enforce the format of the information. Row-based data, for example, can normally be encoded using CSV or even a fixed-width record format.

The second aspect is the encoding and formatting of the information itself. Exchanging data using CSV is fine, providing that you can correctly identify the format of the text itself. Should it be encoded in UTF-8? Or UTF-32? Is plain ASCII better? What if it contains a mix of these characters, should UTF-8 be used as the standard and the actual encoding handled by the database target when the data is imported?

In fact, many of the principles about translating information between different databases also rely on basic best practice for how you design and structure the data in the target system to begin with. Normalisation of the data structure for information can normally be abpplied to any database, even those that might have a loose or undefined structure have conventions. It’s unlikely, for example, that you will call a recipe title field ‘title’ in one record and ‘recipename’ in another record of the same database because your application will be a mess.

Of course, there are times when you may be merging, combining or otherwise consolidating data from a wide variety of different documents, records or blocks of information. There it is up to you to ultimately pick a standardisation for it to be useful to you once it’s been moved into the target system.

As a rough guide for the types of operation and translation that might take place, the following table highlights the kind of structural transformation and changes you might need to make when moving between some of the most common database environments.

Table 2-1: Structural Mappings between database environments

RDBMS Columnar Store Document Database Freetext/unstructured data store
RDBMS Vendor specific only Vendor specific only Field mappings only Application specific
Columnar Store Vendor specific only Vendor specific only Field mappings only Application specific
Document Database Field mappings only Field mappings only Vendor specific only Application specific
Freetext/unstructured data store Application specific Application specific Application specific Application specific

Notes:

  • Vendor specific only changes are those that are directly related to the capabilities of the source or target database. MySQL for example supports the ENUM and SET compound field types, whereas Oracle, PostgreSQL and SQL Server do not. Moving from one to the other may required changes.
  • Field mappings only refers to how you map the source fields or columns to the target fields/columns. Depending on the target this may include compound and/or JOIN based translation. For example, when moving from a document database to an RDBMS you might convert a compound field into a single field, or a lookup table. When translating from an RDBMS to a document store, the data might be combined using a JOIN into a single target field.
  • Application specific changes are those that will entirely depend on how you to use the information. Translating document data into freetext databases is unlikely to require any changes. But converting freetext info into an RDBMS format is going to require some significant identification and translation.

Let’s dig into some more of the specific challenges.

Row-Based Data

For row-based data, the information can generally represented and formatted as one of the regularly used and displayed formats, such as Character Separated Values (i.e. CSV), or in a fixed width format. Row-based data (which includes the column-based data used in big data stores) is probably one of the easiest formats of data to exchange. In nearly all cases the list of columns is usually pretty well fixed and the format of the data is well known because the columns are fixed.

Character Separated Values (not Comma-separated values) is one of the oldest methods of exchanging fixed format data like this, it was often used as the only available method for exchanging information in a reliable fashion. Historically most tabulated data like this tended to be financial information, and so the content and format of the information was relatively simple. As such, the most common format was to use carriage-returns (or the operating system equivalent, which could be newlines or carriage-return and newline characters) to separate the records, while the individual fields in each row of data were separated by a comma (hence Comma-Separated Values as the CSV).

For example:

1085,Creamy egg and leek special,,4,1,0,0,0.0,0.0,0
87,Chakchouka,A traditional Arabian and North African dish and often accompanied with slices of cooked meat                      ,4,1,0,0,0.0,0.0,0
347,Mozzarella and olive pizza,A very simple pizza base made without yeast topped with traditional Italian ingredients. Look out for a low fat type of Mozzarella or Cheddar cheese if a low fat diet is being followed.,4,1,0,0,0.0,0.0,0
720,Savoury pancakes,Basic pancake recipe. Allow 30 min standing time for batter.,8,1,0,0,0.0,0.0,0
477,Seafood filling for pancakes,,8,1,0,0,0.0,0.0,0

The problem with commas and carriage-return characters is that, as computers got more complex, and the data they stored got equally more complex, how do you determine between a comma in some text, and a comma separating a field? What if you transfer a text string that contains a newline or carriage return. You don’t want that interpreted as the end of the record if it happens to part of the field. The initial solution is to use some kind of further delimiter. For example, using double-quotes:

"1085","Creamy egg and leek special","","4","1","0","0","0.0","0.0","0"
"87","Chakchouka","A traditional Arabian and North African dish and often accompanied with slices of cooked meat                      ","4","1","0","0","0.0","0.0","0"
"347","Mozzarella and olive pizza","A very simple pizza base made without yeast topped with traditional Italian ingredients. Look out for a low fat type of Mozzarella or Cheddar cheese if a low fat diet is being followed.","4","1","0","0","0.0","0.0","0"
"720","Savoury pancakes","Basic pancake recipe. Allow 30 min standing time for batter.","8","1","0","0","0.0","0.0","0"
"477","Seafood filling for pancakes","","8","1","0","0","0.0","0.0","0"

This doesn’t fix the problem, it just diverts your attention for long enough to realize that now what happens if one of the delimiting characters needs to be used in the text? We could escape it, by prefixing it with a backslash:

"700","Garlic mushroom kebabs","The longer you leave these mushrooms to marinate, the better they will taste.nGood for barbecue.","8","1","0","0","0.0","0.0","0"

But now we’re getting complex, both to read and write the information, the level of complexity is increasing to point of introducing further possible methods of corrupting the data as it gets transferred.

The alternative is to use a different delimiter that is unlikely to be used within the text in any form. Hadoop in fact follows this model, using the hex characters, 0x01 and 0x0A to delimit records and fields. As binary characters these are unlikely to be used in what is normally human-readable text. Of course, once you start transferring binary data, you need to find another method, such as hex-encoding binary data.

"700","Garlic mushroom kebabs",VGhlIGxvbmdlciB5b3UgbGVhdmUgdGhlc2UgbXVzaHJvb21zIHRvIG1hcmluYXRlLCB0aGUgYmV0dGVyIHRoZXkgd2lsbCB0YXN0ZS5cbkdvb2QgZm9yIGJhcmJlY3VlLgo=,"8","1","0","0","0.0","0.0","0"

The other alternative is to use a fixed width format. This has the advantage that providing you know the exact widths of the individual fields, encoding errors are eliminated because the characters are no longer significant in the format of the information.

The downside of the fixed-width format is that the size of the fields, records, and ultimately files, can become prohibitively large if you are exchanging potentially large or unlimited fields. For example, BLOB types in most databases can be MB or GB in size; expressing that in fixed width format is obviously not a practical solution.

Record-Based Data

Record based is information that may not necessarily be identifiable or resolvable by an easy to use row specific format or structure such as that used in CSV exchange. Complex table data, or information that that is made up of a combination of fixed fields and BLOB fields, for example, is unlikely to reliably, or efficiently, transferred. The problems of character and structural formats will ultimately make using that information difficult or computationally expensive when actively sharing the information – for example by making the file sizes too big to be practically exchanged.

A very typical example of record based information is either information from a document-based database, a free texrt database, or where the information that makes up the majority of the content is in fact really an attachment or noit inline field and database data. Think of an email message; the message, the address, from, subject are all examples of easily identifiable and classifiable database information. But what do you do with an attachment that might have been sent along with the recored?

How about documents generally? Metadata about those documents could be represented in a typical RDBMS row, but not the document itself. But the combination of the two – the metadata and the actual document together make up a ‘record’ that you may want to effectively share all or part of with another database.

When transferring record-based data, think first about what constitutes the record and how that can be represented in the different environments. Then move on to understand how the fields and individual data points can be translate into a format suitable for the target database. With record-based data, it may be that you have a massive volume of data and compound records that when move from a document store to a transactional RDBMS require 20, 30 or 200 rows of data to be represented properly; this is not a problem, providing you find a method for identifying all of the row data that refers to the record is handled correctly.

In general with a record based database the easiest approach is to actually translate the data at the source into something that can be imported directly into the target format. For example, from a record-based environment you can generate four different blocks of CSV import data, one for each table and portion of the source information.

The opposite is actually also true; when converting data from a column or table store into a record based format, it normally makes sense to do this on the basis of translating the key data into the new structure before doing the transfer. As a general rule, either use the native target format if you can, or make use of representable formats such as JSON to do the actual encapsulation of the information. Many record or document-based environments already use JSON, or a format similar to this.

{
    "title": "Fried chilli potatoes",
    "preptime": "5"
    "servings": "4",
    "totaltime": "10",
    "subtitle": "A new way with chips.",
    "cooktime": "5",
    "ingredients": [
        {
            "ingredtext": "chilli powder",
            "ingredient": "chilli powder",
            "meastext": "3-6 tsp"
        },
        {
            "ingredtext": "potatoes, peeled and cut into wedges",
            "ingredient": "potatoes",
            "meastext": "900 g"
        },
        {
            "ingredtext": "vegetable oil for deep frying",
            "ingredient": "vegetable oil for deep frying",
            "meastext": ""
        }
    ],
}

One final consideration is those situations where there is no structure – or the structure is so ephemeral or complex that there is no way to map information. You cannot, indeed should not, either impose a structure just for the sake of one, or inversely rely on sub-standard features in the target database just because it makes your life easier.

Some examples of this include trying to extract key fields or information from unstructured data that are complicated either to identify, or to map back to the original. Extracting a telephone number from a massive text string just because you can does not mean that the telephone number you have extracted is really the one that should be associated with this record in your database. Equally, relying on full-text searching engines within RDBMS environments can be problematic.

Is that a Column or a Field?

Not all fields and columns are created equal, and many of these difficulties come from the complexities or abilities of the database environment being used. Specifically, at which point do you treat a fragment of the data that you are dealing with as a column, or a field, or even just a uniquely identifiable piece of information?

As you move between different environments, the differences become more pronounced or more difficult to identify. True columnar stores, for example, tend to have a significantly reduced number of datatypes and support, and that often limits your ability to store certain values and information. For example, compound values, or specialist types, such as XML, GeoData and spatial points may be storable in one database but not another.

Consider this record, taken from a CouchDB (JSON document) database:

{
   "title" : "Chicken Curry",
   "Keywords" : [
      "Indian",
      "Chicken",
      "Okra",
      "Spicy"
   ],
   "id" : "8023754"
}

Now is the ‘Keywords’ compound object in the document a field, or is it a column? In MySQL we could translate this into a SET datatype, a special datatype, providing we knew what all the possible values for that column are. In Oracle, a field that has multiple possible values like this would normally either be split into separate columns as a bit or boolean value, or it would converted to a lookup table, as in the diagram below.

 

Depending on your use case, within a strict columnar environment such as Cassandra or HP Vertica you might actually consider going the other way and repeating the information with the keyword data in individual rows, like this:

dbadmin=> select * from recipes_kwbytext;
   id    |        title         |     kw
---------+----------------------+------------
 8023754 | Chicken Curry        | Indian
 8023754 | Chicken Curry        | Chicken
 8023754 | Chicken Curry        | Okra
 8023754 | Chicken Curry        | Spicy
(4 rows)

With a column store this can be more efficient if what you are looking for is patterns in the data, because repeated data like this is easy to extract and identify. In this case, what we’ve done is convert something that is a compound field in a document store into multiple rows with the same ID in a column store. This solution can also be used in environments where there are no JOIN operations, or a JOIN is expensive, but where the information is still required at each level. Good examples here are many of the document stores and structureless environments such as Hadoop.

Can you Bypass Datatypes?

Another temptation when translating data between very different database formats is simply to ignore the formatting, and especially the very strict datatypes, that might normally define the data being stored.

This is particularly a problem within those environments where there may be a very limited set of datatypes to work with and can be application specific. For example, the Hive database environment within Hadoop is reqally little more than a thin veneer over a text format used to store the actual data. When you define a table within Hive and then select the rows from the table, Hive parses each row and uses that to display the value in the corresponding format.

This can cause problems for certain data, for example, numbers that are too big, dates that don’t match the very limited set of date formats supported by the Hive parser. In the long term, this causes corruption of the data that you have transferred.

For this reason, some people choose to create tables within Hive that use the Text datatype to display the information rather than the true underlying Integer or Floating Point value as it ensuires the raw value, not the interpreted value will be used.

The same process can be used when moving data; extract the ‘raw’ value rather than hope the source or target database will interpret, store and display the information in the right format.

If you are only sharing or displaying the information in the new target database then there is probably no reason to worry. If you start processing or actively using the data, this is where corruption can occur if you are not storing the information correctly. For example, if an integer is stored and then incremented, you want 10,000 to become 10,001, not 100001.

The bottom line, you can bypass the datatype, but probably shouldn’t if you hope to use the information in the format in which you’ve decided to store it. If you have a datatype, and can identify it from the source material, then use it if the target environment can handle and interpret it correctly. See the notes later in this chapter on limitations in different environments.

Optimization and Performance

Irrespective of the the database environment and the reason for you moving the data, the end goal should always be to move data into a target in a format that will be efficient to use at the target end.

In many document or unstructured systems, or those with very flexible storage mechanism such as Hadoop, the performance will often be predicated not on the structure of the information, but what information is closest to you, or what can be pre-mapped or organized through a map reduce or index generation exercise.

Conversely, RDBMS require highly structured and organized data structures both with and without indexing to provide the best performance. Columnar stores are often much more efficient if you can logically group or sort information together. Some will handle this automatically for you, otherwise are more efficient if you can pre-determine the distribution of the data on which you are most likely to sort and query on. That might mean that when you transfer the data, you sort the generated file by that column or columns before loading. In some extreme examples it may be that you load the data in an unordered format and then move again into a new table with the right column structure.

Don’t be afraid of making the wrong decision, because you can often sort this structure out during a secondary or tertiary stage, but equally don’t ignore it. Having to parse or process large bodies of data a second or third time will be impractical if you are sharing or replicating data compared to single, isolated, dumps.

Ensure Two-way Validity

Without good reason, you should always try and avoid making any kind of translation of format that cannot be either reversed, undone, or translated back into it’s original format, even if that might make the process a little more complicated. Remember that data is often a living organism when being actively used and employed. Therefore doing too much to format, combine, extract or otherwise manipulate the information can then make it difficult to be used again elsewhere.

Note that this is not about normalization. Normalization in typical database parlance means finding the right, fixed, database type for the field data, making it the most efficient choice, and understanding the limits and structure of the data so that you can decide whether a field should be 10 bytes or 12 bytes long. Doing this normally results in identifying the data structure, lookup tables, relations and joins so that you have the right structure. In this context, normalization is really about making the data look like a typical table structure in an RDBMS; normalization for document databases is entirely different. Normalisation for data interoperability is another level still, and we’ve already seen a number of different examples of that.

Instead, think about the quality of the data and how it should be used, while keeping in mind that the structure required for efficiency in an RDBMS may be completely different to the efficient storage of the same information in a document DB, or when resolved down to a text fragment in an unstructured data store.

To return to a well-trodden example, in the section ‘Is that a Column or a Field?’ we looked at the translation of compound values into single or multiple fields. Using a comma to separate the potential values means that we could split the value back out. If the field had been correctly translated either to boolean columns or a linked table is easier to translate back again into a whole variety of formats.

When representing a compound type, think about how you would reverse the structure so that it could be used the other way round. For example, if you decide to dump the information out to another table or column structure, make sure that you add identifiers so that you can track the record it came from (which you’ll probably need anyway), and can reformat it back into that format. If you’ve split it out into multiple rows in a columnar store, make sure you know how to combine it back and deduplicate the information again if you need to send it the other way.

The compound types are the most complex single field type here because there are so many opportunities for you to mess up the translation, but the same is also true for basic structural information, and even more so if you decide that you only want to transfer a smaller number of fields of data from one database to another. Either transfer everything, or transfer what you need and include information (like a unique identifier) so that you can associate it back with the data you extracted it from. Once you’ve lost the context of the information, it can be impossible to get it back and the result is a useless dataset.

Database Metadata

We’ve concentrated very heavily on the actualy data you are storing and want to work with, but what about metadata:

  • Sequences, current auto-increment values?
  • Whether you transfer or record these is going to depend on exactly how you want to use the information when it reaches your destination (or comes back).
  • How about the definition of the structure that you are using? Do you want to be able to share and use that? What happens when the structure changes. Do you want to track and identify those changes?
  • When doing a one-time export of information from one database to another you can be sure about the structure and what you expect to get from it. But what happens when you repeat the same export multiple times? Or when replicating?
  • If you are sharing data between two different systems and integrating them, knowing the sequence number may be irrelevant unless you can synchronize the generation of the number so that it can be used by the multiple databases in an effective manner. Perhaps your databases could use a better unique identification method, rather than relying on a monotonically increasing sequence number, such as UUIDs or using a central identifier registry?

How to address these different problems will be covered in later chapters, but it’s important to think about it here as it has a knock on effect to other areas. For example, when moving unstructured or document based databases into multiple separate tables, you need to identify and tie that information together, where a UUID is important, and it therefore becomes a critical part of the data structure that you swap.


Data Migration: Methods

Throughout the series we will examine four distinct methods for moving and sharing information stored in databases, and each have their specific meanings, techniques and tricks to get the best out of the process. The four methods are:

  • Moving – this refers to the process of moving data to another database for a simple export or data exchange.
  • Migrating – this refers to exchanging the data to a different database to take advantage of other database features.
  • Sharing – this refers to the exchange of data where the full data set needs to be used alongside the existing database, such as analytics.
  • Integrating – this refers to the exchange of data where the data may be combined across multiple databases within the application.

In the next few posts, we’ll start to look at the core structural and data format differences that will affect any movement of information, the series will be divided into specific sections that look at the challenges for specific database types and data exchange operations. These are divided into four main types of data exchange, moving, migrating, sharing and integrating. Despite appearances, these four types are not the same. Each has different traps and considerations before you can make them work.

Moving Data

There are times when you simply want a copy of a selection or subset of the data so that it can be used elsewhere. Classic examples are exporting information from a database for the purposes of a mail merge – the address data is extracted and moved into a format that can be used by the target software (CSV) for this single purpose, for example.

In this situation, the movement of the data is generally temporary; that is, we’re moving the data from the actively used source database into a target database so that we can use it for a specific purpose. Once that purpose is over, the data is deleted or ignored. The source database never stops being the canonical source of the data, and we don’t care about keeping the moved data up to date with the source material; we can just perform another export of the data to achieve that.

Migrating Data

Data migration is where the information needs to be moved, wholesale, into another database system, perhaps because you have decided that you no longer want the application to use MySQL as it’s data store, but MongoDB. Over the life of many different applications the decision is made to move it to a different underlying database system, either to take advantage of it’s scalability or functionality.

As the internet explosion hit many companies, many applications were migrated entirely from their MySQL database to various NoSQL solutions in the belief that this would solve their scalability problems. They weren’t always successful, but the principle is sound. Of course, migrating the data to NoSQL is not without some serious consideration about how the data is moved to the new target.

For the migration to work, the structure, format and most of all the usability of the data in it’s new database are the primary considerations. For all of the promise of improved performance and scalability to be realised, the data must be migrated properly. Simply copying the data over and hoping the database will take of it is not enough.

Of all the considerations should be the requirement that the migrated data has to be updateable in the easiest fashion, and has to suit the application needs and requirements in this respect. Unlike moving data, where the original source of the information is not removed or switched off, in a migration we have to consider what happens to keep the data updated.

Sharing Data

Occasionally you have information in your core database that you need somewhere else for a very specific purpose. When sharing data, the canonical version of the data does not change. For example, you may have a MySQL database and you want to share the data with a key/value store such as Memcached in order to improve the performance of the database. Although we might place the data into Memcached for this purpose, updates to the information are always stored within MySQL. We literally only share the data with Memcached for as long as is needed for the specific situation.

Another good and recent example is the movement of data from an existing datastore, such as Oracle or MySQL into an analytics platform such as Vertica or more recently into Hadoop. The data is moved into this platform only for the purposes of more detailed analysis than would be possible on a single transactional database, or where the data from multiple database shards is being concentrated into a larger single data store for the purposes of analysis.

Again, the data is only shared with the analytics engine – the transactional data store that holds the active records is not changed or altered, and the data is never physically moved from the source database. However, you may want to keep the information synchronised; that is, when a change happens in the source database, it must be updated in the target database

Sharing raises some interesting problems when talking about the movement of data, mostly these are related to the ‘liveness’ of the information being transferred. The physical act of sharing the information is not complex, but doing so in a way that ensures that the information is up to date with the live database presents some issues, particularly if the process is, as with the Memcached example, designed to speed up the access to the information.

Integrating Data

There are times when you want to use multiple formats and databases of data together, whether that is within the same application, different elements of the same application, or linked or connected applications that are able to make better use of specific areas.

For example, you may store your core customer database in MySQL, but want to store a cached version of information for when the customer visits their account page in your web application within a NoSQL or even key/value store so that you have quicker, ready access to the information.

Integration also occurs when you are mixing and matching data from different sources for different purposes. For example, actually natively keeping your customer data in a NoSQL database, while storing transactional information, such as sales, in a transactional RDBMS.

Finally, building on the sharing data example above, a common deployment is to share transactional data with a big data store, perform analytics, and then combine the output to be used back in the web application. Online stores use this method to concentrate sales information from sharded transactional stores, calculate the most popular items or reviews and ratings, and then provide that information back to the web application to be displayed with the data.

The process of integration is different to other systems; you need to consider how the two systems will work together, what information will be shared, and how you relate the information on the target system to the information on the source system. This encompasses elements of the three previous methods, plus some new ones to ensure the integration works as expected.


Extending the Tungsten Replicator Core JS Filter Functionality

Tungsten Replicator has a really cool feature in that we can filter data as it goes past on the wire.

The replicator itself is written entirely in Java and writing filters for it is not as straightforward as it looks, which is why the much better feature is just to use the JavaScript mechanism and write filters using that tool instead. I’ll save the details for how you can write filters to process and massage data for another time, but right now I wanted to find a good way of improving that JavaScript environment.

There are filters, for example, where I want to be able to load JSON option and configuration files, or write out JSON versions of information, and plenty more.

Mozilla’s Rhino JS environment is what is used to provide the internal JS environment for running filters. The way this is supported is that rather than creating a Rhino JS environment that can do whatever it wants, instead, we create a JS instance specifically for executing the required functions within the filter. One of these instances is created for each filter that is configured in the system (and each batch instance too).

The reason we do this is because for each filter, we want each transaction event that appears in the THL log to get executed through the JS instance where the filter() function in the JS filter is executed with a single argument, the event data.

The limitation of this model is that we dont get the full Rhino environment because we execute the JS function directly, so certain top level items and functions like load() or require(), or utilities like JSON.stringify() are not available. We could do that by changing the way we do the configuration, but that could start to get messy quickly, while also complicating the security aspects of how we execute these components.

There are some messy ways in which we could get round this, but in the end, because I also wanted to add some general functionality into the filters system shared across all JS instances, I chose instead to just load a set of utility functions, written in JavaScript, into the JS instance for the filter. The wonderful thing about JS is that we can write all of the functions in JS, even for classes, methods and functions that aren’t provided elsewhere.

So I chose the path of least resistance, which means loading and executing a core JS file before loading and executing the main filter JS so that. We can place into that JS file all of the utility functions we want to be available to all of the filters.

So, to enable this the first thing we do is update the core Java code when we load the filter JS to load our core utility JS first. That occurs in replicator/src/java/com/continuent/tungsten/replicator/filter/JavaScriptFilter.java, within the prepare() function which is where we instantiate the JS environment based on the code.

String coreutilssrc = properties.getString(“replicator.filter.coreutils”);

// Import the standard JS utility script first
try
 {
 // Read and compile the core script functions
 BufferedReader inbase = new BufferedReader(new FileReader(coreutilssrc));
 script = jsContext.compileReader(inbase, scriptFile, 0, null);
 inbase.close();

 script.exec(jsContext, scope);
 }
catch (IOException e)
 {
 throw new ReplicatorException("Core utility library file not found: "
 + coreutilssrc, e);
 }
catch (EvaluatorException e)
 {
 throw new ReplicatorException(e);
 }

This is really straightforward, we obtain the path to the core utilities script from the configuration file (we’ll look at how we define that later), and then compile that within the jsContext object, where our JavaScript is being executed. We add some sensible error checking, but otherwise this is simple.

It’s important to note that this is designed to load that core file *before* the main filter file just in case we want to use anything in there.

Next, that configuration line, we can add into a default config by creating a suitable ‘template’ file for tpm, which we do by creating the file replicator/samples/conf/filters/default/coreutils.tpl. I’ve put it into the filters section because it only applies to filter environments.

The content is simple, it’s the line with the location of our core utility script:

# Defines the core utility script location
replicator.filter.coreutils=${replicator.home.dir}/support/filters-javascript/coreutils.js

And lastly, we need the script itself, replicator/support/filters-javascript/coreutils.js :

// Core utility JavaScript and functions for use in filters
//
// Author: MC Brown (9af05337@opayq.com)


// Simulate the load() function to additional external JS scripts

function load(filename) {
    var file = new java.io.BufferedReader(new java.io.FileReader(new java.io.File(filename)));

    var sb = "";
    while((line = file.readLine()) != null)
        {
            sb = sb + line + java.lang.System.getProperty("line.separator");
        }

    eval(sb);
}

// Read a file and evaluate it as JSON, returning the evaluated portion

function readJSONFile(path)
{
    var file = new java.io.BufferedReader(new java.io.FileReader(new java.io.File(path)));

    var sb = "";
    while((line = file.readLine()) != null)
        {
            sb = sb + line + java.lang.System.getProperty("line.separator");
        }

    jsonval = eval("(" + sb + ")");

    return jsonval;
}

// Class for reoncstituing objects into JSON

JSON = {
    parse: function(sJSON) { return eval('(' + sJSON + ')'); },
    stringify: (function () {
      var toString = Object.prototype.toString;
      var isArray = Array.isArray || function (a) { return toString.call(a) === '[object Array]'; };
      var escMap = {'"': '\\"', '\\': '\\\\', '\b': '\\b', '\f': '\\f', '\n': '\\n', '\r': '\\r', '\t': '\\t'};
      return function stringify(value) {
        if (value == null) {
          return 'null';
        } else if (typeof value === 'number') {
          return isFinite(value) ? value.toString() : 'null';
        } else if (typeof value === 'boolean') {
          return value.toString();
        } else if (typeof value === 'object') {
          if (typeof value.toJSON === 'function') {
            return stringify(value.toJSON());
          } else if (isArray(value)) {
            var res = '[';
            for (var i = 0; i < value.length; i++)
              res += (i ? ', ' : '') + stringify(value[i]);
            return res + ']';
          } else if (toString.call(value) === '[object Object]') {
            var tmp = [];
            for (var k in value) {
              if (value.hasOwnProperty(k))
                tmp.push(stringify(k) + ': ' + stringify(value[k]));
            }
            return '{' + tmp.join(', ') + '}';
          }
        }
        return '"' + value.toString() + '"';
      };
    })()
  };

For the purposes of validating my process, there are three functions:

  • load() – which loads an external JS file and executes it, so that we can load other JS scripts and libraries.
  • readJSONFile() – which loads a JSON file and returns it as a JSON object.
  • JSON class – which does two things, one is provides  JSON.parse() method for parsing strings as JSON objects into JS objects and the other is JSON.stringify() which will turn a JS object back into JSON

Putting all of this together gives you a replicator where we now have some useful functions to make writing JavaScript filters easier. I’ve pushed all of this up into my fork of the Tungsten Replicator code here: https://github.com/mcmcslp/tungsten-replicator/tree/jsfilter-enhance

Now, one final note. Because of the way load() works, in terms of running an eval() on the code to import it, it does mean that there is one final step to make functions useful. To explain what I mean, let’s say you’ve written a new JS filter using the above version of the replicator.

In your filter you include the line:

load("/opt/continuent/share/myreallyusefulfunctions.js");

Within that file, you define a function called runme():

function runme()
{
     logger.info("I'm a bit of text");
}

Now within myreallyusefulfunctions.js I can call that function fine:

runme();

But from within the JS filter, runme() will raise an unknown function error. The reason is that we eval()‘d the source file within the load() function, and so it’s context is wrong.

We can fix that within myreallyusefulfunctions.js by exporting the name explicitly:

if (runme.name) this[runme.name] = runme;

This points the parent namespace to the runme() in this context, and we put that at the end of myreallyusefulfunctions.js script and everything is fine.

I’m lazy, and I haven’t written a convenient function for it, but I will in a future blog.

Now we’ve got this far, let’s start building some useful JS functions and functionality to make it all work nicely…


Extending the Tungsten Replicator Core JS Filter Functionality

Tungsten Replicator has a really cool feature in that we can filter data as it goes past on the wire.
The replicator itself is written entirely in Java and writing filters for it is not as straightforward as it looks, which is why the much better feature is just to use the JavaScript mechanism and write filters using that tool instead. I’ll save the details for how you can write filters to process and massage data for another time, but right now I wanted to find a good way of improving that JavaScript environment.
There are filters, for example, where I want to be able to load JSON option and configuration files, or write out JSON versions of information, and plenty more.
Mozilla’s Rhino JS environment is what is used to provide the internal JS environment for running filters. The way this is supported is that rather than creating a Rhino JS environment that can do whatever it wants, instead, we create a JS instance specifically for executing the required functions within the filter. One of these instances is created for each filter that is configured in the system (and each batch instance too).
The reason we do this is because for each filter, we want each transaction event that appears in the THL log to get executed through the JS instance where the filter() function in the JS filter is executed with a single argument, the event data.
The limitation of this model is that we dont get the full Rhino environment because we execute the JS function directly, so certain top level items and functions like load() or require(), or utilities like JSON.stringify() are not available. We could do that by changing the way we do the configuration, but that could start to get messy quickly, while also complicating the security aspects of how we execute these components.
There are some messy ways in which we could get round this, but in the end, because I also wanted to add some general functionality into the filters system shared across all JS instances, I chose instead to just load a set of utility functions, written in JavaScript, into the JS instance for the filter. The wonderful thing about JS is that we can write all of the functions in JS, even for classes, methods and functions that aren’t provided elsewhere.
So I chose the path of least resistance, which means loading and executing a core JS file before loading and executing the main filter JS so that. We can place into that JS file all of the utility functions we want to be available to all of the filters.
So, to enable this the first thing we do is update the core Java code when we load the filter JS to load our core utility JS first. That occurs in replicator/src/java/com/continuent/tungsten/replicator/filter/JavaScriptFilter.java, within the prepare() function which is where we instantiate the JS environment based on the code.
String coreutilssrc = properties.getString("replicator.filter.coreutils");

// Import the standard JS utility script first
try
 {
 // Read and compile the core script functions
 BufferedReader inbase = new BufferedReader(new FileReader(coreutilssrc));
 script = jsContext.compileReader(inbase, scriptFile, 0, null);
 inbase.close();

 script.exec(jsContext, scope);
 }
catch (IOException e)
 {
 throw new ReplicatorException("Core utility library file not found: "
 + coreutilssrc, e);
 }
catch (EvaluatorException e)
 {
 throw new ReplicatorException(e);
 }
This is really straightforward, we obtain the path to the core utilities script from the configuration file (we’ll look at how we define that later), and then compile that within the jsContext object, where our JavaScript is being executed. We add some sensible error checking, but otherwise this is simple.
It’s important to note that this is designed to load that core file *before* the main filter file just in case we want to use anything in there.
Next, that configuration line, we can add into a default config by creating a suitable ‘template’ file for tpm, which we do by creating the file replicator/samples/conf/filters/default/coreutils.tpl. I’ve put it into the filters section because it only applies to filter environments.
The content is simple, it’s the line with the location of our core utility script:
# Defines the core utility script location
replicator.filter.coreutils=${replicator.home.dir}/support/filters-javascript/coreutils.js

And lastly, we need the script itself, replicator/support/filters-javascript/coreutils.js :
// Core utility JavaScript and functions for use in filters
//
// Author: MC Brown (9af05337@opayq.com)


// Simulate the load() function to additional external JS scripts

function load(filename) {
    var file = new java.io.BufferedReader(new java.io.FileReader(new java.io.File(filename)));

    var sb = "";
    while((line = file.readLine()) != null)
        {
            sb = sb + line + java.lang.System.getProperty("line.separator");
        }

    eval(sb);
}

// Read a file and evaluate it as JSON, returning the evaluated portion

function readJSONFile(path)
{
    var file = new java.io.BufferedReader(new java.io.FileReader(new java.io.File(path)));

    var sb = "";
    while((line = file.readLine()) != null)
        {
            sb = sb + line + java.lang.System.getProperty("line.separator");
        }

    jsonval = eval("(" + sb + ")");

    return jsonval;
}

// Class for reoncstituing objects into JSON

JSON = {
    parse: function(sJSON) { return eval('(' + sJSON + ')'); },
    stringify: (function () {
      var toString = Object.prototype.toString;
      var isArray = Array.isArray || function (a) { return toString.call(a) === '[object Array]'; };
      var escMap = {'"': '\"', '\': '\\', 'b': '\b', 'f': '\f', 'n': '\n', 'r': '\r', 't': '\t'};
      return function stringify(value) {
        if (value == null) {
          return 'null';
        } else if (typeof value === 'number') {
          return isFinite(value) ? value.toString() : 'null';
        } else if (typeof value === 'boolean') {
          return value.toString();
        } else if (typeof value === 'object') {
          if (typeof value.toJSON === 'function') {
            return stringify(value.toJSON());
          } else if (isArray(value)) {
            var res = '[';
            for (var i = 0; i < value.length; i++)
              res += (i ? ', ' : '') + stringify(value[i]);
            return res + ']';
          } else if (toString.call(value) === '[object Object]') {
            var tmp = [];
            for (var k in value) {
              if (value.hasOwnProperty(k))
                tmp.push(stringify(k) + ': ' + stringify(value[k]));
            }
            return '{' + tmp.join(', ') + '}';
          }
        }
        return '"' + value.toString() + '"';
      };
    })()
  };

For the purposes of validating my process, there are three functions:
  • load() – which loads an external JS file and executes it, so that we can load other JS scripts and libraries.
  • readJSONFile() – which loads a JSON file and returns it as a JSON object.
  • JSON class – which does two things, one is provides  JSON.parse() method for parsing strings as JSON objects into JS objects and the other is JSON.stringify() which will turn a JS object back into JSON
Putting all of this together gives you a replicator where we now have some useful functions to make writing JavaScript filters easier. I’ve pushed all of this up into my fork of the Tungsten Replicator code here: https://github.com/mcmcslp/tungsten-replicator/tree/jsfilter-enhance
Now, one final note. Because of the way load() works, in terms of running an eval() on the code to import it, it does mean that there is one final step to make functions useful. To explain what I mean, let’s say you’ve written a new JS filter using the above version of the replicator.
In your filter you include the line:
load("/opt/continuent/share/myreallyusefulfunctions.js");
Within that file, you define a function called runme():
function runme()
{
     logger.info("I'm a bit of text");
}

Now within myreallyusefulfunctions.js I can call that function fine:
runme();
But from within the JS filter, runme() will raise an unknown function error. The reason is that we eval()‘d the source file within the load() function, and so it’s context is wrong.
We can fix that within myreallyusefulfunctions.js by exporting the name explicitly:
if (runme.name) this[runme.name] = runme;
This points the parent namespace to the runme() in this context, and we put that at the end of myreallyusefulfunctions.js script and everything is fine.
I’m lazy, and I haven’t written a convenient function for it, but I will in a future blog.
Now we’ve got this far, let’s start building some useful JS functions and functionality to make it all work nicely…

Data Migration: Database Terms and Structures

In the previous post we looked at a number of different database types and solutions, and it should be clear that there are a huge range of different terms for the different entities that make up the database structure. All the different entities fit into one of four categories, and they have significance because when moving and migrating data you need to know the source and destination type and whether you should be creating a database for every document (bad) or a document for every record (good). The components can be described as shown in Figure 1-6.

Figure 1-6.png

Figure 1-6: Database Terms and Structures

Most databases support the notion of four different components:

  • Field – generally the smallest piece of addressable data within any database. However, not all databases identify information down to the field level. Others don’t even recognise fields at all.
  • Record – a group of fields, or, a single block of identifiable information. For example, your contact information is a record made of the fields that define your name, your address, and your email address. Some databases only support the notion of a block of information and don’t care what it contains, whether that is fields or a binary string of data. Records may also involve either a fixed set of fields, or a variable group.
  • Table – a group of records. Some databases assign a specific group of fields to a specific table. Others just use a table to hold or identify a collection of records with largely similar information. Some database types, such as NoSQL, do not support a table, but immediately jump from record to database.
  • Database – a group of tables. Not all databases support this additional level of organisation, and in fact it tends to be those that have a significant structure at the lower levels (field, record). The database is usually used in the role of multi-tenancy, that is, the ability to store a collection of data related to a single application.

Of course, the problem is that different databases apply and support these terms differently, many use different terms, and some may blur the lines between each term to such an extent that it is impossible to tell where the different elements exist.

Let’s explain this a little further by providing some explicit examples:

  • MySQL, Oracle database, IBM DB2, Microsoft SQL Server, Microsoft Access, and other relational databases tend to support all four levels with a very rigid structure in place, as you would expect from a structured RDBMS.
  • Memcached knows only records (values) identified by a supplied key, and those records have no fields.
  • CouchDB, MongoDB and Couchbase support different databases, and within those databases you have documents, which are logically similar to records. These documents have fields, but there is no requirement for the fields within each document to be the same from document to document. MongoDB also supports collections, which are akin to tables.
  • Hadoop in it’s bare Highly Distributed File System (HDFS) native structure doesn’t understand anything, although you can place files into different directories to mimic a structure. If you use a system on top of HDFS, such as Hive, HBase or Impala, you are normally implying a typical 4-level data architecture.

In general, the ability to identify different components within the database depends on the database type, and a summary of these is provided in the table below.

Database Fields Records Tables Databases
RDBMS Yes Yes Yes Yes
NewSQL Yes Yes Yes Yes
NoSQL Mostly Documents/Rows Maybe Yes
Key/Value Stores No Yes, by ID No Maybe
Unstructured No No No Maybe

Now let’s have a look at the specific example database solutions, including the term used for the corresponding value:

Database Type Database Fields Records Tables Databases
RDBMS Oracle Yes Yes Yes Yes
MySQL Yes Yes Yes Yes
PostgreSQL Yes Yes Yes Yes
NewSQL InfiniDB Yes Yes Yes Yes
TokuDB Yes Yes Yes Yes
NoSQL CouchDB Yes, embedded in JSON Documents No Yes
Couchbase Yes, embedded in JSON Documents No Buckets
MongoDB Yes, embedded in BSON Documents Collections Yes
Cassandra Implied in column family Yes, implied by key ID Implied in Column Family No
HBase Implied in columns Yes Implied in Column Family No
Key/Value Memcached No Yes, by key ID No Maybe
Redis Yes Yes, key/value pair No No
Riak Yes Yes Schema No
Unstructured Hadoop/HDFS No No No By HDFS directory
Hive Yes, if implied Yes, if implied Yes Yes

 

Although it wont be covered in this series to any significant degree, these different levels also tend to support one further distinction, and that is security. Different database solutions provide security at a variety of levels and some allow you to restrict access down to the record level. For all database systems where different databases are supported and their is some level of security or protection between them, these databases are called multi tenant databases.

As we start moving the data between databases, understanding the importance of these elements is critical. For example, when moving data from an RDBMS to Hadoop, the distinction of table or database may disappear, and the significance of individual records may be deliberately removed entirely to enable the information to be processed effectively.

In contrast, moving data from MongoDB into MySQL is easier because we can identify specific elements such as a database and a table. Where we start to become unstuck is that although documents contain a collection of fields, they may not contain the same fields across each document.

Homogeneous vs. Heterogeneous

The primary issue with exchanging information is whether you are moving data between homogeneous or heterogeneous databases. Homogeneous databases are those that are of the same type, for example, moving data from Oracle to MySQL; both are RDBMSs, both have databases, tables, records and fields, and therefore the complexity of moving data between the database is straightforward from a structural perspective. But the datatypes supported are not the same. What do you do about CLOB or RAW datatypes in Oracle when migrated to MySQL?

In a similar vein, the actual procedural process of moving data between database types is similarly affected. MongoDB and Couchbase, for example, support the same structure; JSON and BSON are largely identical, and although there are some differences, reading the data from MongoDB and writing it to Couchbase can be achieved with functions that are almost identical – get the document by it’s ID on MongoDB and set the document on Couchbase with the same ID.

Most RDBMSs can be accessed through SQL and front-ends like JDBC or ODBC, opening two connections and reading/writing are easy to do. Most support the SELECT INTO and LOAD DATA INFILE style SQL to export and import data in larger chunks. But in heterogeneous deployments the same tools are not always available. A quick, but not always accurate, description of these elements across different databases is shown in this table.

Issue Homogeneous Heterogeneous
Data structure No Yes
Data types Yes Yes
Data Loading No Yes
Data Usability Yes Yes

Defining the Problem

Now that we have a good grasp of the different databases, their abilities, and their differences, it is time to take a closer look at what we mean by moving and migrating data and the problems associated with this kind of operation. Now we can finally start to define the problem of exchanging data between different databases and how that process can be tackled and resolved.

All of the following aspects must be considered in entirety before you start to exchange data, but think about it logically and holistically – you have to decide how data will be formatted, how the data is going to look (structure), how the data physically going to be transferred, and finally how it is going to be used.

Altering the Format

All data is not created the same, or in the same format, and furthermore, not all data is supported or acknowledged. Within NoSQL, for example, there may be no datatypes other than string, so you need to consider how you are going to move the data to the right type and the right format without (unnecessarily) losing data. The main considerations are:

Differences in supported types – you may have to choose between migrating to the next nearest, or most appropriate type. NoSQL and all Big Data targets tend not to have strong datatypes, whereas RDBMS database have very strong typing. You must choose a type that is able to handle the data in the way you want, and be able to hold the size of the information being inserted. Large text data, for example, may be too long to fit in a CHAR or VARCHAR column, and may need to be inserted into a BLOB or RAW column.

Differences in type definitions – databases have different definitions of different types. For example, Amazon RedShift supports only 19 digits of precision for floating-point values, while MySQL supports up to 53. Dates and times are also typically represented different, with some only supporting an explicit date type, or supporting a combined date time, or supporting a time with heavily restricted precision. All these differences mean that you may wish to store values outside the given range as a different type; for example, storing dates or timestamps-point values and dates as strings so as not to lose data.

Differences in type interpretation – generally a difficult problem to resolve without extensive testing, some datatypes can be interpreted incorrectly when the data is moved into a target database. String encoding – for example ASCII and Unicode, or bit-specific fields can cause issues. Also timestamps which may be interpreted during import as being subject to time differences; for example, if you exported on a server using Pacific Standard Time (PST) but imported on a different database using Central European Standard Time (CEST).

These issues must be considered in entirety before you exchange data; getting it wrong could lead to incorrect, invalid, and even completely corrupt information.

Altering the Structure

It should be clear right now that there are differences in the structure of the different database types. What may not be clear is that there are more options available to you than a simple direct association from one type to another. Instead you must make sure that the data is exchanged in an effective manner appropriate the information that is being exchanged.

For certain combinations the structure may appear obvious, but there is always the possibility that you the structure and information can be more effectively organised. For example, when moving from an RDBMS to a document store, the first intention is simply to place the different tables and structure them as different documents within the target database. This is fine, but adds complications you may want to avoid when you come to use it. Instead, merging the different tables into one larger document with nested components may simplify the use of the data in the target application.

The same can be true in reverse, exploding a single document into multiple, related, tables. Alternatively, you may want to take advantage of specific functionality in the RDBMS, such as XML fields, sets, enums or even convert the information to embedded JSON or serialised language variables if that makes sense to your application.

Loading the Information

Physically transferring the information seems like the most mundane of the processes in the entire scheme of exchanging data between systems, but in actual fact, it is is less clear than you might think. We’ll look at this in more detail when examining specific examples and database exchange projects, but some upfront issues to consider:

Does the solution include a native bulk loading system. Some databases specifically support a method of importing data, whether larger or small. For example, in MySQL the LOAD DATA INFILE SQL statement can do this for you. Cassandra supports a COPY command in CQL, and various Hadoop interfaces such as HBase and Hive enable you to access CSV files directly without explicitly importing them.

Custom loading may be required if no built-in solution exists. This can take many forms, including writing your own, or if they are available using specialised tools like Tungsten Replicator or  Sqoop. The exact method is going to depend on the data exchange type, data size, and complexity of the load process.

Application loading can be used in those situations where the application is running and a different version or format of the information is used. For example, when caching with a NoSQL engine on top of an RDBMS, you might adapt your application to automatically generate the NoSQL record. Similarly, during a migration, you might configure your application to look in the new database, and if it doesn’t exist, load it from the old database and generate the new record.

Data sizes must be a consideration. It seems ridiculous in this age when disk sizes are so large, but database sizes can be huge too. A recent project I was involved in required migrating just under 150TB of information. Storing all of that data in one go would have a required a mammoth sized disk array before the data was loaded into a Hadoop/Hive database. There are solutions for moving and migrating such large volumes of data without it ever touching the disk and using up all that space.

Depending on your data exchange requirements, any, or all of these may be an issue you have to contend with.

Making the Data Usable

Exchanging data between systems is only any good if once there the data is usable. Nobody would consider releasing a physical book in the USA, and a digital book in France, and not translating it. The same is true of data. Exchanging the data between databases requires you to take these issues into account during the movement of the data; it’s no good just blindly copying the data over and hoping it will be usable.

To make the data usable the following aspects must be considered:

  • Data accessibility – we’ve already talked about the key structural translation that needs to take place, but you also need to think about the effect on elements such as searching and indexing. Certain indexing methods are more complex (and therefore computationally expensive) than others. Some are more efficient. Some database environments support a limited number, quantity or complexity of indexing and querying that can only be addressed if the format and structure of the data is correct to begin with.
  • Data validity – if you change the structure of the data, does that change the ability to validate or otherwise ensure the quality of the information? For example, moving from RDBMS to NoSQL you may lose the ability to single out duplicate entries for certain types and fragments of the dataset. Relational constraints are not enforced within non-relational databases. Data format differences may also present problems; in a NoSQL database, for example, the same strict database types, such as dates, times or numbers do not exist. How do you prevent an invalid date being inserted into a date column, or worse, a non-date value into a date column that would have been identified during a database write?
  • Application usability – if the data is moved, can you still access and update it in the same way? RDBMSs tend to be transactional, providing stability and support, NoSQL databases do not as a rule, particularly across multiple silos. If an invoice is updated, how do I guarantee that the customers account is also updated, especially if one operation, or the database itself, fails during the process?

These are some, but not all, of the issues you need to be aware of. Regardless of the actual method though, you want to actually use the data at the end, so don’t forget how you might query or index the data once it’s moved. Keep in mind that not all data moves require heavy consideration. If you are exporting the data to be loaded for a mail merge for example, the usability aspects may be minor compared to the format and quality of the information.


How to Buffer posts+hashtags from your Blog using Zapier

I try to automate as much my life as possible, particularly when it comes to computers.

I’ve been using the automated ‘Social Sharing’ on WordPress.com (and indeed, my blogs in general) for years. However, I’m also a keen Buffer user and WordPress.com does not offer a Buffer connection. Because I also use Buffer to handle my Patreon posts, concentrating them all in one place would make things a lot easier.

What I wanted to do was something quite straightforward, I wanted to turn a blog post entry into post to Twitter (and others) that turned the list of tags I created on the post into #hashtags. This actually doesn’t seem like a particularly complex or uncommon request, but apparently it’s not a standard offering. What I was even more surprised at was that nobody else seemed to have done the same, which has me confused…

Now there are many options for doing this kind of automated posting, I could have used IFTTT, but IFTTT while incredibly useful (I have about 60 recipes on there) is also incredibly simplistic and your options are limited. That means I can’t post from WordPress to Buffer with the required hashtags.

Zapier is very similar to IFTTT, but also has the option of running multistep Zaps that do more than one thing (IFTTT is limited to one target), but better than that you can include a step that runs information through a JavaScript (or Python) script to do some additional processing.

And this is the key that enables me to do precisely what I need, take a blog post from one of my blogs, process the list of tags into a list of (de-duplicated) hashtags, and then post it into my Buffer queues.

So, here’s how to get Zapier to do what you need, there are going to be five steps to this:

  1. Identify when a new post appears on a WordPress blog
  2. Run a short Javascript program to take the list of tags (actually Terms) from the Blog post into a deduced and hash tagged version
  3. Add it to my Twitter Buffer
  4. Add it to my Facebook Buffer
  5. Add it to my LinkedIn Buffer

Here’s how to get it setup. I’m going to assume you know Zapier and can follow the onscreen instructions, it’s not that complex.

Step 1

  • Register for a Zapier account, if you don’t already have one.
  • Connect your Zapier account to your WordPress blog
  • Connect your Zapier account to your Buffer account

Step 2

Create a new Zap on Zapier.

Select ‘Wordpress’ as your trigger app.

Screenshot 2016-02-21 13.45.18.png

Now configure how you want the trigger to occur. I basically every post in every category, but if you want to add specific categories or other filtering, feel free.

Step 3

For the Action select ‘Code </>’

Screenshot 2016-02-21 13.45.28.png

Now Select ‘Javascript’

Screenshot 2016-02-21 13.45.34.png

When it gets to the Edit Template, you’ll need to specify the input variable to the JavaScript, in this case, create one called ‘tags’ and then select the ‘Terms Name’ from WordPress Step 1 and you’ll be ready to go.

Screenshot 2016-02-21 13.45.40.png

These variables that you select here are placed into a hash (associative array) in the JavaScript context called ‘input’, so in this case, we’ll have the item ‘input.tags’ to parse in our JavaScript code. The actual list of terms will come through as a comma-separated string

The code itself is quite straightforward:

var hashlist = {};

input.tags.split(',').forEach(function(item,index)
{
  var res = item.replace(/ /g,'');
  res = res.toLowerCase();
  res = '#' + res;
  hashlist[res] = 1;
});
return({'hashlist' : Object.keys(hashlist).join(' ')});

We iterate over the terms by using ‘split’ to separate by a comma, then we replace any spaces with nothing (so we turn things like ‘data migration’ to ‘datamigration’, convert it to lower case, add the # prefix and add that all to a new associative array. The reason for this is to get rid of duplicates, so even if we have ‘data migration’ and ‘datamigration’ in the input, we only get one in the output. This is particularly useful because the ‘Terms’ list from WordPress is actually composed of both the tags and the categories for each post.

Finally, we return all of that as a string with all the keys of the hash (ie. our nicely formatted tags) separated by a space. However, just like the input value, we return this as an Object with the string assigned to the field ‘hashlist’. We’ll need this when creating the Buffer post.

I recommend you test this thoroughly and make sure you check the output.

Step 4

Choose your target Buffer.

The Buffer API only allows you to post to one queue at a time, but brilliantly, Zapier lets us add multiple steps and so we can do one for each Buffer queue, in my case, the three. The benefit of this is that I can customise and tune the text and format for each. So, for example, I could omit the tags on Facebook, or, as I do, give a nice intro to the message ‘Please read my new blog post on…’ on FB because I’m not character (or attention span) limited.

Now for each Buffer queue, create your message, and when it comes to choosing the output, make sure you select your JavaScript output (which will be Step 2) and the ‘hashlist’ value.

Step 5

That’s, it! Test it, make sure your posts are appearing, and check your Buffer queue (deleting the entries if required so you don’t double-post items).

You can duplicate and use this as many times as you like, in fact I’ve done this across my two blogs and am now looking into where else I can use the same method.