Anonymizing Data During Replication

If you happen to work with personal data, chances are you are subject to SOX (Sarbanes-Oxley) whether you like it or not.

One of the worst aspects of this is that if you want to be able to analyse your data and you replicate out to another host, you have to find a way of anonymizing the information. There are of course lots of ways of doing this, but if you are replicating the data, why not anonymize it during the replication?

Of the many cool features in Tungsten Replicator, one of my favorites is filtering. This allows you to process the stream of changes that are coming from the data extracted from the master and perform operations on it. We use it a lot in the replicator for ignoring tables, schemas and columns, and for ensuring that we have the correct information within the THL.

Given this, let’s use it to anonymize the data as it is being replicated so that we don’t need to post-process it for analysis, and we’re going to use JavaScript to do that.

For the actual anonymization, we’re going to use a simple function that devolves the content into an anonymizer. For this, I’m going to use the md5 in JavaScript function provided by Paul Johnston here: http://pajhome.org.uk/crypt/md5, although others are available. The main benefit of using md5 is that the same string of text will always be hashed into a consistent value. This is important because it means that we can still run queries with joins between the data, knowing that, for example, the postcode ‘XQ23 1LD’ will be based into ‘d41d8cd98f00b204e9800998ecf8427e’ in every table. Joins still work, and data analysis is entirely valid.

Better still, because it’s happening during replication I always have a machine with anonymised information that I can use to query my data without worrying about SOX tripping me up.

Within the JS filter environment there are two key functions we need to use. One is prepare(), which is called when the replicator goes online, and the other is filter() which processes each event within the THL.

In the prepare() function, I’m going to identify from the configuration file which fields from the configuration file we are going to perform the actual hashing operation on. We do that by creating a hash structure within JavaScript that maps the schema, table name, and field. For example, to anonymise the field ‘postcode’ in ‘address’ in the schema ‘customers’:

stcspec=customers.address.postcode

For this to work, we must have the colnames filter enabled.

The function itself just splits the stcspec parameter from the configuration file into a hash of that combo:

var stcmatch = {};
function prepare()
{
  logger.info("anonymizer: Initializing...");  

    stcspec = filterProperties.getString("stcspec");
    stcarray = stcspec.split(",");
    for(i=0;i<stcarray.length;i++) { 
        stcmatch[stcarray[i]] = 1;
    }

}

The filter() function is provided one value, the event object from the THL. We operate only on ROW-based data (to save us parsing the SQL statement), and then supply the event to an anonymize() function for the actual processing:

function filter(event)
{
  data = event.getData();
  if(data != null)
  {
    for (i = 0; i < data.size(); i++)
    {
      d = data.get(i);

      if (d != null && d instanceof com.continuent.tungsten.replicator.dbms.StatementData)
      {
          // Ignore statements
      }
      else if (d != null && d instanceof com.continuent.tungsten.replicator.dbms.RowChangeData)
      {
          anonymize(event, d);
      }
    }
  }
}

Within the anonymise event, we extract the schema and table name, and then look at each column, and if it exists in our earlier stcspec hash, we change the content of the THL on the way past to be the hashed value, in place of the original field value. To do this we iterate over the rowChanges, then over the columns, then over the individual rows:

function anonymize(event, d)
{
  rowChanges = d.getRowChanges();

  for(j = 0; j < rowChanges.size(); j++)
  {
    oneRowChange = rowChanges.get(j);
    var schema = oneRowChange.getSchemaName();
    var table = oneRowChange.getTableName();
    var columns = oneRowChange.getColumnSpec();

    columnValues = oneRowChange.getColumnValues();
    for (c = 0; c < columns.size(); c++)
    {
      columnSpec = columns.get(c);
          columnname = columnSpec.getName();

      rowchangestc = schema + '.' + table + '.' + columnname;

      if (rowchangestc in stcmatch) {

        for (row = 0; row < columnValues.size(); row++)
        {
            values = columnValues.get(row);
            value = values.get(c);
            value.setValue(hex_md5(value.getValue()));

        }
      }
    }
  }
}

Append the md5() script from Paul Johnston (or indeed whichever md5 / hashing algorithm you want to use) to the end of the entire script text:

var stcmatch = {};
function prepare()
{
  logger.info("anonymizer: Initializing...");  

    stcspec = filterProperties.getString("stcspec");
    stcarray = stcspec.split(",");
    for(i=0;i<stcarray.length;i++) { 
        stcmatch[stcarray[i]] = 1;
    }

}

function filter(event)
{
  data = event.getData();
  if(data != null)
  {
    for (i = 0; i < data.size(); i++)
    {
      d = data.get(i);

      if (d != null && d instanceof com.continuent.tungsten.replicator.dbms.StatementData)
      {
          // Ignore statements
      }
      else if (d != null && d instanceof com.continuent.tungsten.replicator.dbms.RowChangeData)
      {
          anonymize(event, d);
      }
    }
  }
}

function anonymize(event, d)
{
  rowChanges = d.getRowChanges();

  for(j = 0; j < rowChanges.size(); j++)
  {
    oneRowChange = rowChanges.get(j);
    var schema = oneRowChange.getSchemaName();
    var table = oneRowChange.getTableName();
    var columns = oneRowChange.getColumnSpec();

    columnValues = oneRowChange.getColumnValues();
    for (c = 0; c < columns.size(); c++)
    {
      columnSpec = columns.get(c);
          columnname = columnSpec.getName();

      rowchangestc = schema + '.' + table + '.' + columnname;

      if (rowchangestc in stcmatch) {

        for (row = 0; row < columnValues.size(); row++)
        {
            values = columnValues.get(row);
            value = values.get(c);
            value.setValue(hex_md5(value.getValue()));

        }
      }
    }
  }
}

Hint

Depending on your configuration, datatypes and version, the return from getValue() is a byte array, not a character string; in that case, add this function:

function byteArrayToString(byteArray) 
{ 
   str = ""; 
   for (i = 0; i < byteArray.length; i++ ) 
   { 
      str += String.fromCharCode(byteArray[i]); 
   } 
   return str; 
}

And change:

value.setValue(hex_md5(value.getValue()));

to:

value.setValue(hex_md5(byteArrayToString(value.getValue())));

That will correctly convert it into a string.

If the error hits, Tungsten Replicator will just stop on that event, not apply bad data. Putting it ONLINE again after changing the script will re-read the event, re-process it through the filter, and then apply the data.

end Hint

Now we need to manually update the configuration. On a Tungsten Replicator slave, open the static-SERVICENAME.properties file in /opt/continuent/tungsten/tungsten-replicator/conf and then add the following lines within the filter specification area (about 90% of the way through):

replicator.filter.anonymize=com.continuent.tungsten.replicator.filter.JavaScriptFilter 
replicator.filter.anonymize.script=/opt/continuent/share/anonymizer.js 
replicator.filter.anonymize.stcspec=customers.address.postcode

The first line defines a filter called anonymize that uses the JavaScriptFilter engine, the second line specifies the location of the JavaScript file, and the third contains the specification of which fields we will change, separated by a comma.

Now, find the line containing “replicator.stage.q-to-dbms.filters” around about line 200 or so and add ‘anonymize’ to the end of the filter list.

Finally, make sure you copy the anonymizer.js script into the /opt/continuent/share directory (or wherever you want to put it that matches the paths specified above).

Now restart the replicator:

$ replicator restart

On your master, make sure you have the colnames filter-enabled. You can do this in master’s static-SERVICENAME.properties like this:

replicator.stage.binlog-to-q.filters=colnames,pkey

Now restart the master replicator:

$ replicator restart

Double check that the replicator is online using trepctl; it will fail if the config is wrong or the JavaScript isn’t found. If everything is running, go to your master and make sure you enable row-based logging (my.cnf: binlog-format=’ROW’), and then try inserting some data into the table that we are anonymizing:

mysql> insert into address values(0,'QX17 1LG');

Now check the value on the slave:

| 711 | dc889465b382 |

Woohoo!

Anonymized data now exists on the slave without having to manually run the process to clean the data.

If you want to extend the fields this is applied to, add them to the stcspec in the configuration, separating each one by a comma, and make sure you restart the replicator.