Data Migration: Methods

Throughout the series we will examine four distinct methods for moving and sharing information stored in databases, and each have their specific meanings, techniques and tricks to get the best out of the process. The four methods are:

  • Moving – this refers to the process of moving data to another database for a simple export or data exchange.
  • Migrating – this refers to exchanging the data to a different database to take advantage of other database features.
  • Sharing – this refers to the exchange of data where the full data set needs to be used alongside the existing database, such as analytics.
  • Integrating – this refers to the exchange of data where the data may be combined across multiple databases within the application.

In the next few posts, we’ll start to look at the core structural and data format differences that will affect any movement of information, the series will be divided into specific sections that look at the challenges for specific database types and data exchange operations. These are divided into four main types of data exchange, moving, migrating, sharing and integrating. Despite appearances, these four types are not the same. Each has different traps and considerations before you can make them work.

Moving Data

There are times when you simply want a copy of a selection or subset of the data so that it can be used elsewhere. Classic examples are exporting information from a database for the purposes of a mail merge – the address data is extracted and moved into a format that can be used by the target software (CSV) for this single purpose, for example.

In this situation, the movement of the data is generally temporary; that is, we’re moving the data from the actively used source database into a target database so that we can use it for a specific purpose. Once that purpose is over, the data is deleted or ignored. The source database never stops being the canonical source of the data, and we don’t care about keeping the moved data up to date with the source material; we can just perform another export of the data to achieve that.

Migrating Data

Data migration is where the information needs to be moved, wholesale, into another database system, perhaps because you have decided that you no longer want the application to use MySQL as it’s data store, but MongoDB. Over the life of many different applications the decision is made to move it to a different underlying database system, either to take advantage of it’s scalability or functionality.

As the internet explosion hit many companies, many applications were migrated entirely from their MySQL database to various NoSQL solutions in the belief that this would solve their scalability problems. They weren’t always successful, but the principle is sound. Of course, migrating the data to NoSQL is not without some serious consideration about how the data is moved to the new target.

For the migration to work, the structure, format and most of all the usability of the data in it’s new database are the primary considerations. For all of the promise of improved performance and scalability to be realised, the data must be migrated properly. Simply copying the data over and hoping the database will take of it is not enough.

Of all the considerations should be the requirement that the migrated data has to be updateable in the easiest fashion, and has to suit the application needs and requirements in this respect. Unlike moving data, where the original source of the information is not removed or switched off, in a migration we have to consider what happens to keep the data updated.

Sharing Data

Occasionally you have information in your core database that you need somewhere else for a very specific purpose. When sharing data, the canonical version of the data does not change. For example, you may have a MySQL database and you want to share the data with a key/value store such as Memcached in order to improve the performance of the database. Although we might place the data into Memcached for this purpose, updates to the information are always stored within MySQL. We literally only share the data with Memcached for as long as is needed for the specific situation.

Another good and recent example is the movement of data from an existing datastore, such as Oracle or MySQL into an analytics platform such as Vertica or more recently into Hadoop. The data is moved into this platform only for the purposes of more detailed analysis than would be possible on a single transactional database, or where the data from multiple database shards is being concentrated into a larger single data store for the purposes of analysis.

Again, the data is only shared with the analytics engine – the transactional data store that holds the active records is not changed or altered, and the data is never physically moved from the source database. However, you may want to keep the information synchronised; that is, when a change happens in the source database, it must be updated in the target database

Sharing raises some interesting problems when talking about the movement of data, mostly these are related to the ‘liveness’ of the information being transferred. The physical act of sharing the information is not complex, but doing so in a way that ensures that the information is up to date with the live database presents some issues, particularly if the process is, as with the Memcached example, designed to speed up the access to the information.

Integrating Data

There are times when you want to use multiple formats and databases of data together, whether that is within the same application, different elements of the same application, or linked or connected applications that are able to make better use of specific areas.

For example, you may store your core customer database in MySQL, but want to store a cached version of information for when the customer visits their account page in your web application within a NoSQL or even key/value store so that you have quicker, ready access to the information.

Integration also occurs when you are mixing and matching data from different sources for different purposes. For example, actually natively keeping your customer data in a NoSQL database, while storing transactional information, such as sales, in a transactional RDBMS.

Finally, building on the sharing data example above, a common deployment is to share transactional data with a big data store, perform analytics, and then combine the output to be used back in the web application. Online stores use this method to concentrate sales information from sharded transactional stores, calculate the most popular items or reviews and ratings, and then provide that information back to the web application to be displayed with the data.

The process of integration is different to other systems; you need to consider how the two systems will work together, what information will be shared, and how you relate the information on the target system to the information on the source system. This encompasses elements of the three previous methods, plus some new ones to ensure the integration works as expected.