Digital Asset Management for Public Broadcasting: Solr (Part 2 of ??)

The Lucene-based Apache Solr is an incredible platform for building decent search experiences with — especially compared to the “more traditional” database-driven approach with many SQL JOINs that it becomes difficult to efficiently add search features like stemming, ASCII-folding, term highlighting, facets, and synonyms which, I would argue, are essential parts of the discovery experience and you essentially get for free with Solr. Another benefit Solr provides is a foundation for many light-weight interfaces on top of a single index (or, across multiple indexes, because Solr enforces some decent scalability principles that make expanding to task-based indexes easier).

For a DAM project, each asset should appear in the search index with the basic layer of contributed metadata, relationships, metadata extracted from the assets, as well as the administrative metadata managed by Fedora. I would align the fields the the Dublin Core (and DCTerms) elements (which is probably all you can get users to contribute in any case). At this point, because legacy systems lack authority control, linked data, or otherwise, existing metadata is sparse, inaccurate, or limited, which means the entry-level bar is set pretty low, so targeting ease-of-use and metadata collection are the priorities. Eliding a lot of detail, here’s the skeleton schema:

  <field name="id" type="string" indexed="true" stored="true" required="true" />
   <field name="title" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="description" type="string" indexed="true" stored="true"/>

   <dynamicField name="dc.*" type="string" indexed="true" stored="true" multiValued="true"/>
   <dynamicField name="dcterms.*" type="string" indexed="true" stored="true" multiValued="true"/>
   <dynamicField name="rdf.*" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="text" type="text" indexed="true" stored="false" multiValued="true"/>
   <field name="payloads" type="payloads" indexed="true" stored="true"/>
   <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>

   <copyField source="title" dest="title_t" />
   <copyField source="subject" dest="dc.subject" />
   <copyField source="description" dest="description_t" />
   <copyField source="comments" dest="text" />
   <copyField source="dc.creator" dest="author" />
   <copyField source="dc.*" dest="text" />
   <copyField source="text" dest="text_rev" />
   <copyField source="payloads" dest="text" />

  <copyField source="dc.title" dest="dc.title_t" />
  <copyField source="dc.description" dest="dc.description_t" />
  <copyField source="dc.coverage" dest="dc.coverage_t" />
  <copyField source="dc.contributor" dest="dc.contributor_t" />
  <copyField source="dc.subject" dest="dc.subject_t" />
  <copyField source="dc.contributor" dest="names_t" />
  <copyField source="dc.coverage" dest="names_t" />

The new edismax query parser provides a great balance of flexibility, advanced query features, and ease-of-use that it seems like an obvious choice here.

The only penalty you pay by using solr is having to keep the solr index synchronized with your data sources. For synchronizing data from Fedora, there are now a proliferation of options, ranging from the task-specific with java plugins like GSearch and Shelver to the more generic (ESBs and all that) like Apache Camel or the Ruote-based Fedora Workflow component. Because DAM likely involves many different workflows, I lean towards the more generic solutions. Lately, I’ve given Camel a try, and after a couple days of java-dependency-induced head pounding, I have something that works.

On twitter, John Tynan requested a virtual machine image to encourage others to begin playing with this software, so I’ve actually begun building some of these pieces. Currently, I have Fedora/Camel/Solr/Blacklight installed and functional, but before I try to package it us, I feel like I should add an easy-to-use ingest system to get data in.

This entry was posted in Repository, TODO and tagged , . Bookmark the permalink.

4 Responses to Digital Asset Management for Public Broadcasting: Solr (Part 2 of ??)

  1. Hello Chris,

    “Because DAM likely involves many different workflows, I lean towards the more generic solutions”

    A workflow engine like ruote is made for “many different workflows”, could you explain the “leaning” towards an ESB ?

    Best regards,

    John

  2. chris says:

    Hi John — I think ruote and Camel are both very viable and nearly interchangeable here (and the demonstrator in github makes ruote a very attractive option). The two “advantages” Camel has here (that aren’t technical, even) are very slight: (1) I learned a couple months ago of other institutions doing ESB-based workflows with institutional repositories so there’s probably an emerging community of practice with some interesting components to share, and (2) it was easier to get working under Tomcat. I’m also intrigued by the variety of components/participants Camel offers, and am slightly curious how much effort it’d take to convert a JCR to the Fedora API.

    The biggest advantage ruote has is dynamic process definitions (and, for integrating with some of the architecture in Fedora, dynamic process definitions described by XML), which I’m still trying to replicate under Camel. Finally, in my tinkering with ruote, I’m a little concerned I haven’t been doing “it” right, so working with something that strictly enforces “Enterprise Integration Patterns” might be a valuable perspective.

  3. Got it, many thanks for the detailed answer !

  4. Mark says:

    Digital Asset Management for Public Broadcasting: Solr (Part 2 of ??) – Authoritative Opinion abner@comemail.net

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>