Solr Data Input Handler

This week, I had the opportunity to write a data import handler (DIH) for the Solr search server, which elegantly mapped a mySQL database to the Solr schema. Before this, I had been writing small scripts with an XML output, because the scope of the underlying data wasn’t neatly contained in a single document or database. This is a new feature in Solr 1.3, and it really seems to make integrating search almost trivial, to the point where anyone who can write an SQL query can begin replacing the in-built fulltext engines with a Solr service, offering more flexibility, efficient faceting, and a document-centric view appropriate for search.

The basic skeleton looked something like this:

<dataConfig>
        <dataSource driver="com.mysql.jdbc.Driver" batchSize="-1" url="jdbc:mysql://localhost:3306/cms?zeroDateTimeBehavior=convertToNull" user="root" />
<document name="doc">
        <entity transformer="RegexTransformer" name="page" query="SELECT ... FROM ... JOIN ... JOIN ... JOIN ..">
<field column="title" name="dc.title" />
[...]
<field column="names" splitBy="," name="dc.contributor" />
        </entity>
    </document>
</dataConfig>

A couple things to note:
In the dataSource configuration, I’ve set the batchSize=”-1″, which lowers the number of rows kept in memory and prevents solr (and the servlet engine) from running out of memory

Second, in the jdbc configuration, I’m using zeroDateTimeBehavior=convertToNull, which is a very easy way of dealing with those pesky “0000-00-00 00:00:00″ dates that normally come out of the database, and allows solr to gracefully skip that field.

In some multivalued field declarations (like the names -> dc.contributor), I’m using the regex transformer, and its helper splitBy, to reverse a mySQL GROUP_CONCAT() field, which at least saves a query (and forces more of the data marshaling logic into the SQL query, leaving the Solr mapping fairly straightforward).

The Solr transformers look incredibly powerful and almost certainly worth pursuing further in the future. One update I eagerly await is the integration of the DIH with Solr Cell, a text+metadata extraction service, under [#SOLR-1358], which would let you merge previously extracted (or entered) metadata with the fulltext of documents. When this feature is added, I think I can pretty much give up on my transforming scripts and switch to the DIH for all purposes.

This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to Solr Data Input Handler

  1. davidbhon says:

    hi chris,

    i have installed fedora3.2.1 with its jms option (evidently embedded activemq 5.1.0)
    but i’d rather not use an embedding. in fact i want to try the newly released activemq
    5.3.0. my goal is to setup a stand-alone activemq (with the included admin console),
    and stand-alone solr (1.3 or 1.4 whenever it comes out) that ‘consumes’ messages
    ‘produced’ by fedora whenever the repository content changes. have you succeeded in anything like this?

    cheers,
    –david

    • chris says:

      Yes, that’s actually the same setup I used. I set up a standalone ActiveMQ 5.2 server running under Jetty listening on port 61616.

      In fedora.fcfg, change the java.naming.provider.url parameter from

      to

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>