Category Archives: Uncategorized

Data Migration: Moving the Actual Data

There are two key elements to the exchange of any information between databases. One is the data structure used for the exchange, and the other is the transformation required to reach those structures.

Some of these are driven by the source database, others by the target database. For example, when moving data from RDBMS to NoSQL database generally requires constructing documents from what might be tabular, or joined-tabular data. This may involve both join elements on the relational side, as well as formatting on the NoSQL side. The eventual aim is to ensure that the data reaches the target database in both the right format, and without corruption, and also in a format that is most appropriate or efficient. That ultimately depends on what you are using the transferred data for.

The other aspect is the difference between source and target data types – that is the format and construction of the individual fields or components of the data. Document databases and Big Data stores tend not to care about the data type, whereas RDBMS cannot live without them.

Some important considerations for how we use this information:

  • The data structure must be retained (i.e., we must be able to identify the fields, columns or other elements of the data).
  • The data format and integrity must be maintained (the data should not be corrupted, shortened or reduced in any way).
  • The data must be able to be efficiently transferred (sending a 1GB file that only contains 15KB of valid information is not efficient).

In this chapter we’ll examine some of the key differences and problems with transferring data that transcend the mechanics of the process, and how to deal with them effectively. Although the three primary aspects, basic formatting, structural comparisons and data type limitations are handled separately here, there are few occasions when you can truly treat these elements individually. We’ll see some examples of this as we go through.

Basic Interchange Formats

When you start to move data efficiently between the different database types that you exist you will find a number of different possible interchange formats, and the primary issue with all of them is exactly how efficiently, and more importantly accurately, they enable the information to be exchanged.

First, the data will need to have been encoded with some kind of encapsulation format. This format is what describes the individual structure, and is entirely dependent on the underlying data that is being exchanged. For example, if the data has a very rigid structure then that will obviously normally enforce the format of the information. Row-based data, for example, can normally be encoded using CSV or even a fixed-width record format.

The second aspect is the encoding and formatting of the information itself. Exchanging data using CSV is fine, providing that you can correctly identify the format of the text itself. Should it be encoded in UTF-8? Or UTF-32? Is plain ASCII better? What if it contains a mix of these characters, should UTF-8 be used as the standard and the actual encoding handled by the database target when the data is imported?

In fact, many of the principles about translating information between different databases also rely on basic best practice for how you design and structure the data in the target system to begin with. Normalisation of the data structure for information can normally be abpplied to any database, even those that might have a loose or undefined structure have conventions. It’s unlikely, for example, that you will call a recipe title field ‘title’ in one record and ‘recipename’ in another record of the same database because your application will be a mess.

Of course, there are times when you may be merging, combining or otherwise consolidating data from a wide variety of different documents, records or blocks of information. There it is up to you to ultimately pick a standardisation for it to be useful to you once it’s been moved into the target system.

As a rough guide for the types of operation and translation that might take place, the following table highlights the kind of structural transformation and changes you might need to make when moving between some of the most common database environments.

Table 2-1: Structural Mappings between database environments

RDBMS Columnar Store Document Database Freetext/unstructured data store
RDBMS Vendor specific only Vendor specific only Field mappings only Application specific
Columnar Store Vendor specific only Vendor specific only Field mappings only Application specific
Document Database Field mappings only Field mappings only Vendor specific only Application specific
Freetext/unstructured data store Application specific Application specific Application specific Application specific

Notes:

  • Vendor specific only changes are those that are directly related to the capabilities of the source or target database. MySQL for example supports the ENUM and SET compound field types, whereas Oracle, PostgreSQL and SQL Server do not. Moving from one to the other may required changes.
  • Field mappings only refers to how you map the source fields or columns to the target fields/columns. Depending on the target this may include compound and/or JOIN based translation. For example, when moving from a document database to an RDBMS you might convert a compound field into a single field, or a lookup table. When translating from an RDBMS to a document store, the data might be combined using a JOIN into a single target field.
  • Application specific changes are those that will entirely depend on how you to use the information. Translating document data into freetext databases is unlikely to require any changes. But converting freetext info into an RDBMS format is going to require some significant identification and translation.

Let’s dig into some more of the specific challenges.

Row-Based Data

For row-based data, the information can generally represented and formatted as one of the regularly used and displayed formats, such as Character Separated Values (i.e. CSV), or in a fixed width format. Row-based data (which includes the column-based data used in big data stores) is probably one of the easiest formats of data to exchange. In nearly all cases the list of columns is usually pretty well fixed and the format of the data is well known because the columns are fixed.

Character Separated Values (not Comma-separated values) is one of the oldest methods of exchanging fixed format data like this, it was often used as the only available method for exchanging information in a reliable fashion. Historically most tabulated data like this tended to be financial information, and so the content and format of the information was relatively simple. As such, the most common format was to use carriage-returns (or the operating system equivalent, which could be newlines or carriage-return and newline characters) to separate the records, while the individual fields in each row of data were separated by a comma (hence Comma-Separated Values as the CSV).

For example:

1085,Creamy egg and leek special,,4,1,0,0,0.0,0.0,0
87,Chakchouka,A traditional Arabian and North African dish and often accompanied with slices of cooked meat                      ,4,1,0,0,0.0,0.0,0
347,Mozzarella and olive pizza,A very simple pizza base made without yeast topped with traditional Italian ingredients. Look out for a low fat type of Mozzarella or Cheddar cheese if a low fat diet is being followed.,4,1,0,0,0.0,0.0,0
720,Savoury pancakes,Basic pancake recipe. Allow 30 min standing time for batter.,8,1,0,0,0.0,0.0,0
477,Seafood filling for pancakes,,8,1,0,0,0.0,0.0,0

The problem with commas and carriage-return characters is that, as computers got more complex, and the data they stored got equally more complex, how do you determine between a comma in some text, and a comma separating a field? What if you transfer a text string that contains a newline or carriage return. You don’t want that interpreted as the end of the record if it happens to part of the field. The initial solution is to use some kind of further delimiter. For example, using double-quotes:

"1085","Creamy egg and leek special","","4","1","0","0","0.0","0.0","0"
"87","Chakchouka","A traditional Arabian and North African dish and often accompanied with slices of cooked meat                      ","4","1","0","0","0.0","0.0","0"
"347","Mozzarella and olive pizza","A very simple pizza base made without yeast topped with traditional Italian ingredients. Look out for a low fat type of Mozzarella or Cheddar cheese if a low fat diet is being followed.","4","1","0","0","0.0","0.0","0"
"720","Savoury pancakes","Basic pancake recipe. Allow 30 min standing time for batter.","8","1","0","0","0.0","0.0","0"
"477","Seafood filling for pancakes","","8","1","0","0","0.0","0.0","0"

This doesn’t fix the problem, it just diverts your attention for long enough to realize that now what happens if one of the delimiting characters needs to be used in the text? We could escape it, by prefixing it with a backslash:

"700","Garlic mushroom kebabs","The longer you leave these mushrooms to marinate, the better they will taste.nGood for barbecue.","8","1","0","0","0.0","0.0","0"

But now we’re getting complex, both to read and write the information, the level of complexity is increasing to point of introducing further possible methods of corrupting the data as it gets transferred.

The alternative is to use a different delimiter that is unlikely to be used within the text in any form. Hadoop in fact follows this model, using the hex characters, 0x01 and 0x0A to delimit records and fields. As binary characters these are unlikely to be used in what is normally human-readable text. Of course, once you start transferring binary data, you need to find another method, such as hex-encoding binary data.

"700","Garlic mushroom kebabs",VGhlIGxvbmdlciB5b3UgbGVhdmUgdGhlc2UgbXVzaHJvb21zIHRvIG1hcmluYXRlLCB0aGUgYmV0dGVyIHRoZXkgd2lsbCB0YXN0ZS5cbkdvb2QgZm9yIGJhcmJlY3VlLgo=,"8","1","0","0","0.0","0.0","0"

The other alternative is to use a fixed width format. This has the advantage that providing you know the exact widths of the individual fields, encoding errors are eliminated because the characters are no longer significant in the format of the information.

The downside of the fixed-width format is that the size of the fields, records, and ultimately files, can become prohibitively large if you are exchanging potentially large or unlimited fields. For example, BLOB types in most databases can be MB or GB in size; expressing that in fixed width format is obviously not a practical solution.

Record-Based Data

Record based is information that may not necessarily be identifiable or resolvable by an easy to use row specific format or structure such as that used in CSV exchange. Complex table data, or information that that is made up of a combination of fixed fields and BLOB fields, for example, is unlikely to reliably, or efficiently, transferred. The problems of character and structural formats will ultimately make using that information difficult or computationally expensive when actively sharing the information – for example by making the file sizes too big to be practically exchanged.

A very typical example of record based information is either information from a document-based database, a free texrt database, or where the information that makes up the majority of the content is in fact really an attachment or noit inline field and database data. Think of an email message; the message, the address, from, subject are all examples of easily identifiable and classifiable database information. But what do you do with an attachment that might have been sent along with the recored?

How about documents generally? Metadata about those documents could be represented in a typical RDBMS row, but not the document itself. But the combination of the two – the metadata and the actual document together make up a ‘record’ that you may want to effectively share all or part of with another database.

When transferring record-based data, think first about what constitutes the record and how that can be represented in the different environments. Then move on to understand how the fields and individual data points can be translate into a format suitable for the target database. With record-based data, it may be that you have a massive volume of data and compound records that when move from a document store to a transactional RDBMS require 20, 30 or 200 rows of data to be represented properly; this is not a problem, providing you find a method for identifying all of the row data that refers to the record is handled correctly.

In general with a record based database the easiest approach is to actually translate the data at the source into something that can be imported directly into the target format. For example, from a record-based environment you can generate four different blocks of CSV import data, one for each table and portion of the source information.

The opposite is actually also true; when converting data from a column or table store into a record based format, it normally makes sense to do this on the basis of translating the key data into the new structure before doing the transfer. As a general rule, either use the native target format if you can, or make use of representable formats such as JSON to do the actual encapsulation of the information. Many record or document-based environments already use JSON, or a format similar to this.

{
    "title": "Fried chilli potatoes",
    "preptime": "5"
    "servings": "4",
    "totaltime": "10",
    "subtitle": "A new way with chips.",
    "cooktime": "5",
    "ingredients": [
        {
            "ingredtext": "chilli powder",
            "ingredient": "chilli powder",
            "meastext": "3-6 tsp"
        },
        {
            "ingredtext": "potatoes, peeled and cut into wedges",
            "ingredient": "potatoes",
            "meastext": "900 g"
        },
        {
            "ingredtext": "vegetable oil for deep frying",
            "ingredient": "vegetable oil for deep frying",
            "meastext": ""
        }
    ],
}

One final consideration is those situations where there is no structure – or the structure is so ephemeral or complex that there is no way to map information. You cannot, indeed should not, either impose a structure just for the sake of one, or inversely rely on sub-standard features in the target database just because it makes your life easier.

Some examples of this include trying to extract key fields or information from unstructured data that are complicated either to identify, or to map back to the original. Extracting a telephone number from a massive text string just because you can does not mean that the telephone number you have extracted is really the one that should be associated with this record in your database. Equally, relying on full-text searching engines within RDBMS environments can be problematic.

Is that a Column or a Field?

Not all fields and columns are created equal, and many of these difficulties come from the complexities or abilities of the database environment being used. Specifically, at which point do you treat a fragment of the data that you are dealing with as a column, or a field, or even just a uniquely identifiable piece of information?

As you move between different environments, the differences become more pronounced or more difficult to identify. True columnar stores, for example, tend to have a significantly reduced number of datatypes and support, and that often limits your ability to store certain values and information. For example, compound values, or specialist types, such as XML, GeoData and spatial points may be storable in one database but not another.

Consider this record, taken from a CouchDB (JSON document) database:

{
   "title" : "Chicken Curry",
   "Keywords" : [
      "Indian",
      "Chicken",
      "Okra",
      "Spicy"
   ],
   "id" : "8023754"
}

Now is the ‘Keywords’ compound object in the document a field, or is it a column? In MySQL we could translate this into a SET datatype, a special datatype, providing we knew what all the possible values for that column are. In Oracle, a field that has multiple possible values like this would normally either be split into separate columns as a bit or boolean value, or it would converted to a lookup table, as in the diagram below.

 

Depending on your use case, within a strict columnar environment such as Cassandra or HP Vertica you might actually consider going the other way and repeating the information with the keyword data in individual rows, like this:

dbadmin=> select * from recipes_kwbytext;
   id    |        title         |     kw
---------+----------------------+------------
 8023754 | Chicken Curry        | Indian
 8023754 | Chicken Curry        | Chicken
 8023754 | Chicken Curry        | Okra
 8023754 | Chicken Curry        | Spicy
(4 rows)

With a column store this can be more efficient if what you are looking for is patterns in the data, because repeated data like this is easy to extract and identify. In this case, what we’ve done is convert something that is a compound field in a document store into multiple rows with the same ID in a column store. This solution can also be used in environments where there are no JOIN operations, or a JOIN is expensive, but where the information is still required at each level. Good examples here are many of the document stores and structureless environments such as Hadoop.

Can you Bypass Datatypes?

Another temptation when translating data between very different database formats is simply to ignore the formatting, and especially the very strict datatypes, that might normally define the data being stored.

This is particularly a problem within those environments where there may be a very limited set of datatypes to work with and can be application specific. For example, the Hive database environment within Hadoop is reqally little more than a thin veneer over a text format used to store the actual data. When you define a table within Hive and then select the rows from the table, Hive parses each row and uses that to display the value in the corresponding format.

This can cause problems for certain data, for example, numbers that are too big, dates that don’t match the very limited set of date formats supported by the Hive parser. In the long term, this causes corruption of the data that you have transferred.

For this reason, some people choose to create tables within Hive that use the Text datatype to display the information rather than the true underlying Integer or Floating Point value as it ensuires the raw value, not the interpreted value will be used.

The same process can be used when moving data; extract the ‘raw’ value rather than hope the source or target database will interpret, store and display the information in the right format.

If you are only sharing or displaying the information in the new target database then there is probably no reason to worry. If you start processing or actively using the data, this is where corruption can occur if you are not storing the information correctly. For example, if an integer is stored and then incremented, you want 10,000 to become 10,001, not 100001.

The bottom line, you can bypass the datatype, but probably shouldn’t if you hope to use the information in the format in which you’ve decided to store it. If you have a datatype, and can identify it from the source material, then use it if the target environment can handle and interpret it correctly. See the notes later in this chapter on limitations in different environments.

Optimization and Performance

Irrespective of the the database environment and the reason for you moving the data, the end goal should always be to move data into a target in a format that will be efficient to use at the target end.

In many document or unstructured systems, or those with very flexible storage mechanism such as Hadoop, the performance will often be predicated not on the structure of the information, but what information is closest to you, or what can be pre-mapped or organized through a map reduce or index generation exercise.

Conversely, RDBMS require highly structured and organized data structures both with and without indexing to provide the best performance. Columnar stores are often much more efficient if you can logically group or sort information together. Some will handle this automatically for you, otherwise are more efficient if you can pre-determine the distribution of the data on which you are most likely to sort and query on. That might mean that when you transfer the data, you sort the generated file by that column or columns before loading. In some extreme examples it may be that you load the data in an unordered format and then move again into a new table with the right column structure.

Don’t be afraid of making the wrong decision, because you can often sort this structure out during a secondary or tertiary stage, but equally don’t ignore it. Having to parse or process large bodies of data a second or third time will be impractical if you are sharing or replicating data compared to single, isolated, dumps.

Ensure Two-way Validity

Without good reason, you should always try and avoid making any kind of translation of format that cannot be either reversed, undone, or translated back into it’s original format, even if that might make the process a little more complicated. Remember that data is often a living organism when being actively used and employed. Therefore doing too much to format, combine, extract or otherwise manipulate the information can then make it difficult to be used again elsewhere.

Note that this is not about normalization. Normalization in typical database parlance means finding the right, fixed, database type for the field data, making it the most efficient choice, and understanding the limits and structure of the data so that you can decide whether a field should be 10 bytes or 12 bytes long. Doing this normally results in identifying the data structure, lookup tables, relations and joins so that you have the right structure. In this context, normalization is really about making the data look like a typical table structure in an RDBMS; normalization for document databases is entirely different. Normalisation for data interoperability is another level still, and we’ve already seen a number of different examples of that.

Instead, think about the quality of the data and how it should be used, while keeping in mind that the structure required for efficiency in an RDBMS may be completely different to the efficient storage of the same information in a document DB, or when resolved down to a text fragment in an unstructured data store.

To return to a well-trodden example, in the section ‘Is that a Column or a Field?’ we looked at the translation of compound values into single or multiple fields. Using a comma to separate the potential values means that we could split the value back out. If the field had been correctly translated either to boolean columns or a linked table is easier to translate back again into a whole variety of formats.

When representing a compound type, think about how you would reverse the structure so that it could be used the other way round. For example, if you decide to dump the information out to another table or column structure, make sure that you add identifiers so that you can track the record it came from (which you’ll probably need anyway), and can reformat it back into that format. If you’ve split it out into multiple rows in a columnar store, make sure you know how to combine it back and deduplicate the information again if you need to send it the other way.

The compound types are the most complex single field type here because there are so many opportunities for you to mess up the translation, but the same is also true for basic structural information, and even more so if you decide that you only want to transfer a smaller number of fields of data from one database to another. Either transfer everything, or transfer what you need and include information (like a unique identifier) so that you can associate it back with the data you extracted it from. Once you’ve lost the context of the information, it can be impossible to get it back and the result is a useless dataset.

Database Metadata

We’ve concentrated very heavily on the actualy data you are storing and want to work with, but what about metadata:

  • Sequences, current auto-increment values?
  • Whether you transfer or record these is going to depend on exactly how you want to use the information when it reaches your destination (or comes back).
  • How about the definition of the structure that you are using? Do you want to be able to share and use that? What happens when the structure changes. Do you want to track and identify those changes?
  • When doing a one-time export of information from one database to another you can be sure about the structure and what you expect to get from it. But what happens when you repeat the same export multiple times? Or when replicating?
  • If you are sharing data between two different systems and integrating them, knowing the sequence number may be irrelevant unless you can synchronize the generation of the number so that it can be used by the multiple databases in an effective manner. Perhaps your databases could use a better unique identification method, rather than relying on a monotonically increasing sequence number, such as UUIDs or using a central identifier registry?

How to address these different problems will be covered in later chapters, but it’s important to think about it here as it has a knock on effect to other areas. For example, when moving unstructured or document based databases into multiple separate tables, you need to identify and tie that information together, where a UUID is important, and it therefore becomes a critical part of the data structure that you swap.


Extending the Tungsten Replicator Core JS Filter Functionality

Tungsten Replicator has a really cool feature in that we can filter data as it goes past on the wire.

The replicator itself is written entirely in Java and writing filters for it is not as straightforward as it looks, which is why the much better feature is just to use the JavaScript mechanism and write filters using that tool instead. I’ll save the details for how you can write filters to process and massage data for another time, but right now I wanted to find a good way of improving that JavaScript environment.

There are filters, for example, where I want to be able to load JSON option and configuration files, or write out JSON versions of information, and plenty more.

Mozilla’s Rhino JS environment is what is used to provide the internal JS environment for running filters. The way this is supported is that rather than creating a Rhino JS environment that can do whatever it wants, instead, we create a JS instance specifically for executing the required functions within the filter. One of these instances is created for each filter that is configured in the system (and each batch instance too).

The reason we do this is because for each filter, we want each transaction event that appears in the THL log to get executed through the JS instance where the filter() function in the JS filter is executed with a single argument, the event data.

The limitation of this model is that we dont get the full Rhino environment because we execute the JS function directly, so certain top level items and functions like load() or require(), or utilities like JSON.stringify() are not available. We could do that by changing the way we do the configuration, but that could start to get messy quickly, while also complicating the security aspects of how we execute these components.

There are some messy ways in which we could get round this, but in the end, because I also wanted to add some general functionality into the filters system shared across all JS instances, I chose instead to just load a set of utility functions, written in JavaScript, into the JS instance for the filter. The wonderful thing about JS is that we can write all of the functions in JS, even for classes, methods and functions that aren’t provided elsewhere.

So I chose the path of least resistance, which means loading and executing a core JS file before loading and executing the main filter JS so that. We can place into that JS file all of the utility functions we want to be available to all of the filters.

So, to enable this the first thing we do is update the core Java code when we load the filter JS to load our core utility JS first. That occurs in replicator/src/java/com/continuent/tungsten/replicator/filter/JavaScriptFilter.java, within the prepare() function which is where we instantiate the JS environment based on the code.

String coreutilssrc = properties.getString(“replicator.filter.coreutils”);

// Import the standard JS utility script first
try
 {
 // Read and compile the core script functions
 BufferedReader inbase = new BufferedReader(new FileReader(coreutilssrc));
 script = jsContext.compileReader(inbase, scriptFile, 0, null);
 inbase.close();

 script.exec(jsContext, scope);
 }
catch (IOException e)
 {
 throw new ReplicatorException("Core utility library file not found: "
 + coreutilssrc, e);
 }
catch (EvaluatorException e)
 {
 throw new ReplicatorException(e);
 }

This is really straightforward, we obtain the path to the core utilities script from the configuration file (we’ll look at how we define that later), and then compile that within the jsContext object, where our JavaScript is being executed. We add some sensible error checking, but otherwise this is simple.

It’s important to note that this is designed to load that core file *before* the main filter file just in case we want to use anything in there.

Next, that configuration line, we can add into a default config by creating a suitable ‘template’ file for tpm, which we do by creating the file replicator/samples/conf/filters/default/coreutils.tpl. I’ve put it into the filters section because it only applies to filter environments.

The content is simple, it’s the line with the location of our core utility script:

# Defines the core utility script location
replicator.filter.coreutils=${replicator.home.dir}/support/filters-javascript/coreutils.js

And lastly, we need the script itself, replicator/support/filters-javascript/coreutils.js :

// Core utility JavaScript and functions for use in filters
//
// Author: MC Brown (9af05337@opayq.com)


// Simulate the load() function to additional external JS scripts

function load(filename) {
    var file = new java.io.BufferedReader(new java.io.FileReader(new java.io.File(filename)));

    var sb = "";
    while((line = file.readLine()) != null)
        {
            sb = sb + line + java.lang.System.getProperty("line.separator");
        }

    eval(sb);
}

// Read a file and evaluate it as JSON, returning the evaluated portion

function readJSONFile(path)
{
    var file = new java.io.BufferedReader(new java.io.FileReader(new java.io.File(path)));

    var sb = "";
    while((line = file.readLine()) != null)
        {
            sb = sb + line + java.lang.System.getProperty("line.separator");
        }

    jsonval = eval("(" + sb + ")");

    return jsonval;
}

// Class for reoncstituing objects into JSON

JSON = {
    parse: function(sJSON) { return eval('(' + sJSON + ')'); },
    stringify: (function () {
      var toString = Object.prototype.toString;
      var isArray = Array.isArray || function (a) { return toString.call(a) === '[object Array]'; };
      var escMap = {'"': '\\"', '\\': '\\\\', '\b': '\\b', '\f': '\\f', '\n': '\\n', '\r': '\\r', '\t': '\\t'};
      return function stringify(value) {
        if (value == null) {
          return 'null';
        } else if (typeof value === 'number') {
          return isFinite(value) ? value.toString() : 'null';
        } else if (typeof value === 'boolean') {
          return value.toString();
        } else if (typeof value === 'object') {
          if (typeof value.toJSON === 'function') {
            return stringify(value.toJSON());
          } else if (isArray(value)) {
            var res = '[';
            for (var i = 0; i < value.length; i++)
              res += (i ? ', ' : '') + stringify(value[i]);
            return res + ']';
          } else if (toString.call(value) === '[object Object]') {
            var tmp = [];
            for (var k in value) {
              if (value.hasOwnProperty(k))
                tmp.push(stringify(k) + ': ' + stringify(value[k]));
            }
            return '{' + tmp.join(', ') + '}';
          }
        }
        return '"' + value.toString() + '"';
      };
    })()
  };

For the purposes of validating my process, there are three functions:

  • load() – which loads an external JS file and executes it, so that we can load other JS scripts and libraries.
  • readJSONFile() – which loads a JSON file and returns it as a JSON object.
  • JSON class – which does two things, one is provides  JSON.parse() method for parsing strings as JSON objects into JS objects and the other is JSON.stringify() which will turn a JS object back into JSON

Putting all of this together gives you a replicator where we now have some useful functions to make writing JavaScript filters easier. I’ve pushed all of this up into my fork of the Tungsten Replicator code here: https://github.com/mcmcslp/tungsten-replicator/tree/jsfilter-enhance

Now, one final note. Because of the way load() works, in terms of running an eval() on the code to import it, it does mean that there is one final step to make functions useful. To explain what I mean, let’s say you’ve written a new JS filter using the above version of the replicator.

In your filter you include the line:

load("/opt/continuent/share/myreallyusefulfunctions.js");

Within that file, you define a function called runme():

function runme()
{
     logger.info("I'm a bit of text");
}

Now within myreallyusefulfunctions.js I can call that function fine:

runme();

But from within the JS filter, runme() will raise an unknown function error. The reason is that we eval()‘d the source file within the load() function, and so it’s context is wrong.

We can fix that within myreallyusefulfunctions.js by exporting the name explicitly:

if (runme.name) this[runme.name] = runme;

This points the parent namespace to the runme() in this context, and we put that at the end of myreallyusefulfunctions.js script and everything is fine.

I’m lazy, and I haven’t written a convenient function for it, but I will in a future blog.

Now we’ve got this far, let’s start building some useful JS functions and functionality to make it all work nicely…


Using the Continuent Docs

As hopefully has been noticed, the Continuent documentation is achieving a pretty good critical mass. The content of the documentation is always the most important consideration. Secondary is making sure that the information in the documentation can be found, and that when reading, you can hover and click to get relevant information so that you can understand the content and information being provided even better.

We’ve got a few different solutions and tips that I think are worth highlighting so that people can use the documentation more effectively.

Searching

When you want to look for something in the documentation, use the search bar right up at the top. The search is available both on the Documentation Library page and within individual documents.

Screen Shot 2014-03-12 at 07.13.22

When used on the Documentation Library page, search shows you potential matches across all the documentation for the word or item you are searching for. For example, here where I’ve searched for FAQ. Entries are ranked by the manual according to releases:

Screen Shot 2014-03-12 at 07.17.32

When searching within a document, you get shown the items within this document first, followed by matches within other documents:

Screen Shot 2014-03-12 at 07.22.39

The search content itself is heavily indexed and designed so that you should go to the right item as the first one in the list.

It also works both on wide terms, for example, Filters, but it also works on commands, and command-line arguments and options within a typical command. For example, type ‘trepctl status’ and you will get not only the key command, but all it’s derivatives. But type in an option, like ‘-at-event’, and you’ll get the explicit entry for that item.

Screen Shot 2014-03-12 at 07.28.44

Note that the search is very deliberately not a free-text search. This is to ensure that you get to exactly the right page, rather than all the pages that might mention ‘trepctl status’.

Hover Highlights

When reading the documentation you might come across some terms or information that you are not familiar with. In this case, hover over the item and you’ll get a definition.

Screen Shot 2014-03-12 at 07.40.13

Click the highlighted item, and you’ll get taken to the reference page for that specific item.

Deep Linking

I mentioned the mechanics of this process recently, but the use-case within the documentation is that virtually everything of significance is automatically linked to the right, canonical, page for the information.

For example, in the image below, there are links to the various ONLINE and OFFLINE states that can be clicked on, and the same is true for nearly all filenames, options, commands, and all combinations thereof.

Screen Shot 2014-03-12 at 10.12.11

Related Pages

In certain sections, links to other pages that might be useful to the current discussion, but which we do not directly link to in reference to another item are listed in the sidebar.

This is supported for related pages:

Screen Shot 2014-03-12 at 10.25.58

FAQ entries:

Screen Shot 2014-03-12 at 10.52.26

We don’t have entries yet, but release note and Error/Cause/Solution (troubleshooting) links are supported too. Note that these links only appear on pages that have the related items.

Table of Contents Navigation

Immediately above the related pages is the basic navigation section. These are divided into:

  • Parent Sections – these are sections at the same level as the current page that you might want to jump to. For example, you can easily jump from Fan-In to Star deployments.
  • Navigate Up – Goes up the parent.
  • Chapters – A list of all the chapters and appendices in this manual.

Other Manuals

For each page in each manual we also provide a link to the same page in other manuals. There are two reasons for this, the first is so that you can compare or jump to differences in other versions of the same manual. The second is to jump between the Tungsten Replicator and Continuent Tungsten if you find yourself in the right page, but the wrong product manual.

Screen Shot 2014-03-12 at 10.42.58

So as you can see, there’s a lot more to the docs than just the content (critical though it is), and hopefully this has helped to explain how usable the documentation is and more important how easy it should be to find the information you need.


More Datacenters in a Container

Not exactly portable (unless you have a container lorry or ship handy), but I’d love to have one of Sun’s Sun’s Modular Datacenters available to use.

The Modular Datacenter (otherwise known as Project Blackbox) squeezes 7 or 8 racks, depending on the the configuration fits 240 or 280 Us of rackspace into a standard shipping container. The idea is that you can deploy one of these containers very quickly with your desired set up, either to handle all of your datacenter requirements (which it could), or it’s used as a quick solution for a particular project.

Interestingly, I can also see it as a potential solution for cloud computing. Not only would a one of these containers make an excellent datacenter in it’s own right, but the ability to quickly expand the capacity when you need kinda takes the idea of cloud computing and easily expanded capacity to another level.

Sun are not the only people doing these datacenters in a container though:

All of them promote the density and power of the systems and, more importantly in todays climate, their energy efficiency.

In an ideal world we’d line these all up next to each other and compare them, but I hardly think that would be practical. It would, however, be interesting to see what sort of power, performance and flexibility these systems offered.

System Administration Toolkit: Set up remote access in UNIX through OpenSSH

OpenSSH has become the defacto standard for connecting to remote machines in a secure way. Not only do you use it for simple interactivel terminal access, but for transferring files and as a backbone for services like Subversion for securely transferring files.

In System Administration Toolkit: Set up remote access in UNIX through OpenSSH I show you how to set up OpenSSH, how to copy files using sftp and scp, and how to set up password-less login by copying across your OpenSSH keys.

Read System Administration Toolkit: Set up remote access in UNIX through OpenSSH

Tracking and allocating time

I’m going to try a new technique this week to see where my time goes.

I’m going to keep a log of everything that I do to see where my time goes. This is a common technique employed and suggested by many people as a way to see where you are spending your time.

However, rather than doing this in isolation, I’m going to combine the tracking of what I do with an attempt to pre-allocate what I expect to spend my time doing. For example, I know what projects I have outstanding and when they need to be completed. I also have some meetings booked in next week. I’m going to combine what I know I’ll be doing (meetings) with what I should be doing (projects) and then allocate the time accordingly.

At the end of the two weeks, I’m going to compare the two - planned and actual. I should be able to determine:

  • What I spent my time doing.
  • Whether my estimations on the time required to complete a project matched the actual. This will help me plan future weeks.
  • Identify weaknesses and potential interruptions.
  • Identify why and when I’m most likely to get sidetracked.

I’ll report back on my progress. Meanwhile, if anybody wants to perform the same task - planning and recording - and discuss their experiences, please leave them here :)

Simplifying your life

Henry David Thoreau said:

As you simplify your life, the laws of the universe will be simpler; solitude will not be solitude, poverty will not be poverty, nor weakness weakness.

Among the many things which I’m going to be trying is to simply try and reduce the amount of time I spend dealing with things that aren’t work. If you think about your current environment, you’ll probably find that there are a great number of tasks in your life which exist only because they support some aspect of your life out of which you gain very little.

For example, just recently I sat down and made a list of the demands on my time - ignoring the obvious entries of actually writing a book, article or other elements, these are the other elements that take up a good portion of my time, but don’t actually relate to doing any work.

  • Reading magazines and newspapers - often these will pile up to the point that I either have to spend a day going through them, or I simply throw them away because they are now so old or out of date that it is not worth reading their contents.
  • Filing - there’s not much I can do, immediately, to improve this, but the amount of filing is impacted by other things.
  • Accounts - again, there’s not a lot I can do here, although anything I can do to reduce the number of accounts or the complexity would help.
  • IT Management - as a self employed person I don’t have the benefit of a dedicated IT department to deal with issues like backups, faults, or installing new hardware and software.
  • Mailing lists - I’m on a lot of postal mailing lists, despite always ticking the ‘Don’t share my details’ box; this generates post, which in turn need to be filed or shredded.
  • Computer mailing lists - I’m on even more computer mailing lists, many. No, most of which I never get a chance to read, but which I still have to delete or file.

If we extend this to items outside of work that still have an influence we can include:

  • More magazines and comics - some of these I save, and that means I have to seal up and catalog them.
  • More accounts - additional savings accounts, credit cards and others, some of which require monthly or quarterly management.
  • Personal memberships - I belong to quite a few clubs and societies, some of which I get little benefit from.

By making the lists I was able to ask some serious questions about how these things affect my time:

  • If I never get to read the magazines, why do I buy them?
  • If, as in some cases, I’m only buying the comic to keep it as a potentially valuable item, am I wasting more money dealing with it now than it’s worth later?
  • If I simplified the way I operated the accounts (merged some, removed others that I don’t fully use) there would be less to process, and fewer items to file.
  • If I reduced the clubs and organizations I’m a member of to those that I get the most benefit from, I’d again simplify (a small) element of the accounts, save money, and reduce the time required to deal with the paperwork.
  • If I returned unsolicited mail, I wouldn’t have to deal with it anymore.
  • If I simplified my IT infrastructure and the lowered the amount of work required to keep it going, I could save myself a few hours every week that could be better spent elsewhere.

So here’s my first action plan of changing the way I work:

  1. Simplify my accounts, close those I’m not using, merge some others (particularly investments).
  2. Arrange to pay the accounts that can be covered through direct debits or standing orders.
  3. Do the accounts, including data entry, as I get the information. This means the accounts file will never build up, and I’ll never have to think back to what a particular charge or other item relates come year end or tax time.
  4. Close subscriptions to magazines and clubs out of which I get little or no benefit.
  5. Return every small piece of mail that I didn’t either request or expect to reduce the amount of junk mail that I have to deal with.
  6. Unsubscribe from magazines and newspapers I don’t read.
  7. Unsubscribe from the computer mailing lists I don’t read.
  8. Get rid of my ‘filing’ pile - instead, I’ll now try and file items as soon as they come in. For those that need processing (for example data entry, payment etc.), I’ll do it that day (or ASAP if I’m out) and then file it immediately.
  9. Simplify my IT environment. I’m going to reduce the servers I have and try to reduce them to two, from six, and change both new machines to the same platform.

This approach will require some changes into long-held habits and practices (like creating a pile of filing or accounts to be done in one big batch), but I’m hoping it will prevent the piles building up, and that should, in turn, reduce the feeling of dread when I have to deal with them. It should also help me keep a clean desk and reduce the amount of time I spend either throwing away or otherwise filing material that never gets to the see the light of day.

The whole experience - listing what I do in a typical day or week - has been an incredible useful way of determining where my time goes and if nothing else I now have a better idea of what I spend my time on, rather than the belief that at the end of of the day I’ve achieved nothing.

Rosin

If you listen to ‘The Devil Went Down to Georgia’ by the Charlie Daniel’s Band there’s a line that reads:

Johnny, rosin up your bow and play your fiddle hard

After years of listening to this I’d always wondered by ‘rosin up your bow’ meant.

The word rosin is the key part to the phrase, and rosin refers to a substance used to treat the bows of stringed instruments (including a fiddle as heard in the tune). Rosin itself is the solid amber residue that remains after turpentine oleoresin or naptha from pine trees is destilled. Rosin is not just used for bow strings, it’s also used in adhesives, varnishes, inks and, in powdered form, as an anti-slip agent (similar to the use of chalk in weightlifting or gymnastics).

Wikipedia has an excellent page that goes into more detail.

Welcome!

Sometimes I don’t feel like I’ve achieved as much as I could have during the day. Sometimes that feeling lasts for weeks.

In an effort to change that, and hopefully improve my performance and the amount of work I achieve in a day, I’m going to try a variety of different techniques in an attempt to find the one that works for me, and then report back on the progress I’ve made. Hopefully during that process I’ll cover some techniques and methodology that might help other people to change the way they work and help their situation.

Change the way you work is therefore all about ways to improve your working environment, and hopefully help you to improve your performance and your ability to get your job done.

Along the way we’ll look at books, techniques, methodologies, hardware, software and other tools that can help.

If you’ve got a particular technique or technology that you think might help, use the Contact Us form.