XML Importer overview (a.k.a. TCP 2.0)

Table of Contents

1. Introduction
2. Import scenario's supported by TCP semantics
3. TCP 2.0: a new sophisticated import scenario: Find and Merge
4. Find and Merge scenario details
4.1. Object similarity
4.2. Merging objects
4.3. Unreferenced objects
4.4. Feedback
5. Extensions to the TCP-semantics: TCP 2.0
5.1. new object operator: mergeObjects (objectType)
5.2. objectFinder
5.3. objectMerger
5.4. new parameter for createObject: disposeWhenNotReferenced
5.5. new parameter for transactions: reportFile
5.6. SimilarObjectFinder interface
5.7. ObjectMerger interface
6. TransactionHandler enhancements (performance and usability)
6.1. Performance
6.2. Usability
7. TCP 2.0 Syntax
7.1. Transactions context
7.2. Transaction contexts
7.3. Object contexts
7.4. Object Merge contexts
8. TCP 2.0 dtd

The goal of the XML importer is to extend MMBase with powerful new XML import facilities, facilitating bulk import of data from different sources (e.g. third parties).

This document gives an overview of the XML Importer that is brought to the open source distribution of MMBase. The XML Importer is largely based on an implementation that is build and tested for the VPRO.

Before the XML Importer project, one way to bulk import data in an efficient way, was by means of XML-defined transactions, using the vocabulary defined by the Temporary Cloud Project (TCP). While these semantics are sufficient to populate empty MMBase tables with new objects, they are very limited in other situations.

There are two TCP scenarios supported by MMBase through TCP semantics. The simplest scenario for bulk data input is adding new object graphs:

As a result the new objects and relations are added to MMBase. If similar objects were already present in MMBase, this would result in duplicates.

To demonstrate how this translates to TCP-semantics, the following example adds two new objects (type "movie" and "person") with one relation ("director"):

		<createobject id ="m1" type="movies">
			<setfield name="title">psycho</setfield>
			<setfield name="year">1960</setfield>

		<createObject id="p1" type="persons">
			<setField name="firstname">Alfred</setField>
			<setField name="lastname">Hitchcock</setField>

		<createRelation type="director" source="m1" destination="p1"/>

The second and slightly more advanced TCP scenario adds new object graphs, involving existing MMBase objects as well:

This results in both new objects and new relations, involving both new and existing objects.

Disadvantage of the latter scenario is, the MMBase objects involved have to be explicitly identified by their MMBase-id. Because of this, we cannot define such a transaction without prior inspection of existing MMBase objects.

This example demonstrates how an existing MMBase object can be accessed within a transaction, to set its fields to new values:

<accessObject id="p1" mmbaseId="12345">
	<setField name="firstname">Alfred</setField>
	<setField name="lastname">Hitchcock</setField>

XML Importer introduces a more sophisticated scenario that:

Formally speaking: we present MMBase with the object (sub)graphs that we want to be in MMBase. This approach focuses on the desired result, instead of detailing the steps to be taken. It shifts the burden of detailing all the steps to the side of MMBase.

In order to see how to accomplish this, let's look at an example. We want to add the same objects ("persons" and "movies") and relation ("director") as in the previous examples, but following the proposed scenario, that avoids duplicates by taking into account the objects already present in MMBase.

This transaction will result in both objects and the relation to be present in MMBase, regardless of what was present before the transaction, and without duplicates being created. This is the behavior we are looking for.

Note that these steps as presented in the previous paragraph are actually very straight forward, and can easily be formalized to cover the general case of many objects and relations. Also note that the results depend on a notion of object similarity, so let's look into what we really mean by that (having avoided using the word "equality").

Processing of a transaction may fail because for an input object more than one similar object is found or an error occurs XXX the import is stopped XXX but this will not stop the whole import. The current transaction is canceled. The Importer continues with the next transaction. All transactions without duplicates or errors are committed to MMBase.

If for an input object more than one similar object is found the following happens. The complete transaction is appended to a report file. In the next stage duplicates_transactions.XML is processed. The user has to be consulted to decide which merge result is preferred.

Example. If there was an input object A and two similar objects were found (B and C). The following is presented to the user user on screen (probably a jsp page): the original input object (A) and for every similar object the merge result. Thus (A+B) and (A+C). The user has to select which merge result is preferred. Processing of this corrected transaction can continue in a next processing cycle.

All other kind of errors, e.g. syntax error (XML not according to dtd) or object field not found or object not found. For all these errors the transaction processing is canceled an entry is written to a report file and the full transaction is written to a file (e.g. error_transactions.XML). XXX the import is stopped XXX The user can consult the report-file afterward to review the transactions that went wrong. The report-file will contain all information necessary to correct the problems and give these transactions a second try.

To implement the Find and Merge scenario extension of the TCP-semantics is necessary. These extended semantics allow us to instruct the Transaction Handler to carry out the tasks.

Further enhancements that will make TCP functionality easier to use than it is now (due to its SCAN- heritage)

The complete syntax for the XML-compliant TCP2.0 transaction language is presented here. See also the Transactions.dtd. TCP2.0 is an extended version of the TCP. See TCP project for details.

The TCP 2.0 language is quite hierarchical. There is one 'Transactions context' within which can be more 'Transaction contexts', within which can be more 'Object contexts' or 'Object merge contexts'. Within an 'Object context' more fields can be defined. Within an 'Object merge context' more parameters can be defined.

(The names 'Transactions context' and 'Transaction contexts' might lead to some confusion. We are tied to those names because TCP 2.0 has to be backwards compatible with TCP.)

<?xml version="1.0" encoding="UTF-8"?> <!-- dec. 1st. 2001 -->

<!ELEMENT transactions (create | open | commit | delete)* >
<!ATTLIST transactions exceptionPage CDATA #IMPLIED>
<!ATTLIST transactions reportFile CDATA #IMPLIED> <!-- TCP2.0 -->
<!ATTLIST transactions key CDATA #IMPLIED>

<!ELEMENT create ((createObject | createRelation | openObject | accessObject | deleteObject | markObjectDelete)*, mergeObject*, mergeObjects*) > <!-- TCP2.0 added mergeObjects* -->
<!ATTLIST create commit (true | false) "true">
<!ATTLIST create timeOut CDATA "60">

<!ELEMENT open ((createObject | createRelation | openObject | accessObject | deleteObject | markObjectDelete)*, mergeObject*, mergeObjects*) > <!-- TCP2.0 added mergeObjects* -->
<!ATTLIST open commit (true | false) "true">

<!ELEMENT commit EMPTY >

<!ELEMENT delete EMPTY >

<!-- OBJECTS -->
<!ELEMENT createObject (setField*)>
<!ATTLIST createObject id CDATA #IMPLIED>
<!ATTLIST createObject type CDATA #REQUIRED>
<!ATTLIST createObject disposeWhenNotReferenced (true | false) "false"> <!-- TCP2.0 -->

<!ELEMENT createRelation (setField*)>
<!ATTLIST createRelation id CDATA #IMPLIED>
<!ATTLIST createRelation type CDATA #REQUIRED>
<!ATTLIST createRelation source CDATA #REQUIRED>
<!ATTLIST createRelation destination CDATA #REQUIRED>

<!ELEMENT openObject (setField*)>

<!ELEMENT deleteObject EMPTY >
<!ATTLIST deleteObject id CDATA #REQUIRED>

<!ELEMENT accessObject (setField*)>
<!ATTLIST accessObject mmbaseId CDATA #REQUIRED>
<!ATTLIST accessObject id CDATA  #IMPLIED>

<!ELEMENT markObjectDelete EMPTY >
<!ATTLIST markObjectDelete mmbaseId CDATA #REQUIRED>
<!ATTLIST markObjectDelete deleteRelations (true | false) "false">

<!ELEMENT mergeObject (objectMerger) > <!-- TCP2.0 -->
<!ATTLIST mergeObject from CDATA #REQUIRED>

<!ELEMENT mergeObjects (objectMatcher, objectMerger) > <!-- TCP2.0 -->
<!ATTLIST mergeObjects type CDATA #REQUIRED > <!-- TCP2.0 -->

<!ELEMENT objectMatcher (param*) > <!-- TCP2.0 -->
<!ATTLIST objectMatcher class CDATA "org.mmbase.module.tcp.match.NodeMatcher" > <!-- TCP2.0 -->

<!ELEMENT objectMerger (param*) > <!-- TCP2.0 -->
<!ATTLIST objectMerger class CDATA "org.mmbase.module.tcp" > <!-- TCP2.0 -->

<!-- FIELDS -->
<!ELEMENT setField (#PCDATA) >

<!-- PARAMETERS --> <!-- TCP2.0 -->
<!ELEMENT param EMPTY> <!-- TCP2.0 -->
<!ATTLIST param name CDATA #REQUIRED> <!-- TCP2.0 -->
<!ATTLIST param value CDATA #REQUIRED> <!-- TCP2.0 -->

This is part of the MMBase documentation.

For questions and remarks about this documentation mail to: documentation@mmbase.org