<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Musings &#187; solr</title>
	<atom:link href="http://cbeer.info/blog/tag/solr/feed/" rel="self" type="application/rss+xml" />
	<link>http://cbeer.info/blog</link>
	<description></description>
	<lastBuildDate>Thu, 09 Sep 2010 22:14:06 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1-alpha</generator>
		<item>
		<title>Digital Asset Management for Public Broadcasting: Solr (Part 2 of ??)</title>
		<link>http://cbeer.info/blog/2010/05/08/digital-asset-management-for-public-broadcasting-solr-part-2-of/</link>
		<comments>http://cbeer.info/blog/2010/05/08/digital-asset-management-for-public-broadcasting-solr-part-2-of/#comments</comments>
		<pubDate>Sat, 08 May 2010 14:29:34 +0000</pubDate>
		<dc:creator>chris</dc:creator>
				<category><![CDATA[Repository]]></category>
		<category><![CDATA[TODO]]></category>
		<category><![CDATA[digital asset management]]></category>
		<category><![CDATA[solr]]></category>

		<guid isPermaLink="false">http://authoritativeopinion.com/blog/?p=335</guid>
		<description><![CDATA[The Lucene-based Apache Solr is an incredible platform for building decent search experiences with &#8212; especially compared to the &#8220;more traditional&#8221; database-driven approach with many SQL JOINs that it becomes difficult to efficiently add search features like stemming, ASCII-folding, term &#8230; <a href="http://cbeer.info/blog/2010/05/08/digital-asset-management-for-public-broadcasting-solr-part-2-of/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>The Lucene-based <a href="http://lucene.apache.org/solr">Apache Solr</a> is an incredible platform for building decent search experiences with &#8212; especially compared to the &#8220;more traditional&#8221; database-driven approach with many SQL JOINs that it becomes difficult to efficiently add search features like stemming, ASCII-folding, term highlighting, facets, and synonyms which, I would argue, are essential parts of the discovery experience and you essentially get for free with Solr. Another benefit Solr provides is a foundation for many light-weight interfaces on top of a single index (or, across multiple indexes, because Solr enforces some decent scalability principles that make expanding to task-based indexes easier).</p>
<p>For a DAM project, each asset should appear in the search index with the basic layer of contributed metadata, relationships, metadata extracted from the assets, as well as the administrative metadata managed by Fedora. I would align the fields the the Dublin Core (and DCTerms) elements (which is probably all you can get users to contribute in any case). At this point, because legacy systems lack authority control, linked data, or otherwise, existing metadata is sparse, inaccurate, or limited, which means the entry-level bar is set pretty low, so targeting ease-of-use and metadata collection are the priorities. Eliding a lot of detail, here&#8217;s the skeleton schema:</p>
<pre name="code" class="xml">
  &lt;field name="id" type="string" indexed="true" stored="true" required="true" /&gt;
   &lt;field name="title" type="string" indexed="true" stored="true" multiValued="true"/&gt;
   &lt;field name="description" type="string" indexed="true" stored="true"/&gt;

   &lt;dynamicField name="dc.*" type="string" indexed="true" stored="true" multiValued="true"/&gt;
   &lt;dynamicField name="dcterms.*" type="string" indexed="true" stored="true" multiValued="true"/&gt;
   &lt;dynamicField name="rdf.*" type="string" indexed="true" stored="true" multiValued="true"/&gt;
   &lt;field name="text" type="text" indexed="true" stored="false" multiValued="true"/&gt;
   &lt;field name="payloads" type="payloads" indexed="true" stored="true"/&gt;
   &lt;field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/&gt;

   &lt;copyField source="title" dest="title_t" /&gt;
   &lt;copyField source="subject" dest="dc.subject" /&gt;
   &lt;copyField source="description" dest="description_t" /&gt;
   &lt;copyField source="comments" dest="text" /&gt;
   &lt;copyField source="dc.creator" dest="author" /&gt;
   &lt;copyField source="dc.*" dest="text" /&gt;
   &lt;copyField source="text" dest="text_rev" /&gt;
   &lt;copyField source="payloads" dest="text" /&gt;

  &lt;copyField source="dc.title" dest="dc.title_t" /&gt;
  &lt;copyField source="dc.description" dest="dc.description_t" /&gt;
  &lt;copyField source="dc.coverage" dest="dc.coverage_t" /&gt;
  &lt;copyField source="dc.contributor" dest="dc.contributor_t" /&gt;
  &lt;copyField source="dc.subject" dest="dc.subject_t" /&gt;
  &lt;copyField source="dc.contributor" dest="names_t" /&gt;
  &lt;copyField source="dc.coverage" dest="names_t" /&gt;
</pre>
<p>The new <a href="https://issues.apache.org/jira/browse/SOLR-1553">edismax query parser</a> provides a great balance of flexibility, advanced query features, and ease-of-use that it seems like an obvious choice here.</p>
<p>The only penalty you pay by using solr is having to keep the solr index synchronized with your data sources. For synchronizing data from Fedora, there are now a proliferation of options, ranging from the task-specific with java plugins like <a href="http://www.fedora-commons.org/confluence/display/FCSVCS/Generic+Search+Service+2.2">GSearch</a> and <a href="http://github.com/mediashelf/shelver">Shelver</a> to the more generic (ESBs and all that) like <a href="http://camel.apache.org/">Apache Camel</a> or the Ruote-based <a href="http://github.com/cbeer/fedora-workflow">Fedora Workflow</a> component. Because DAM likely involves many different workflows, I lean towards the more generic solutions. Lately, I&#8217;ve given Camel a try, and after a couple days of java-dependency-induced head pounding, I have something that works.</p>
<p>&#8212;</p>
<p>On twitter, <a href="http://twitter.com/johntynan/status/13400294844">John Tynan requested</a> a virtual machine image to encourage others to begin playing with this software, so I&#8217;ve actually begun building some of these pieces. Currently, I have Fedora/Camel/Solr/Blacklight installed and functional, but before I try to package it us, I feel like I should add an easy-to-use ingest system to get data in.</p>
]]></content:encoded>
			<wfw:commentRss>http://cbeer.info/blog/2010/05/08/digital-asset-management-for-public-broadcasting-solr-part-2-of/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>NPR API + Solr = ?</title>
		<link>http://cbeer.info/blog/2010/03/19/npr-api-solr/</link>
		<comments>http://cbeer.info/blog/2010/03/19/npr-api-solr/#comments</comments>
		<pubDate>Sat, 20 Mar 2010 00:55:19 +0000</pubDate>
		<dc:creator>chris</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[api]]></category>
		<category><![CDATA[npr]]></category>
		<category><![CDATA[solr]]></category>

		<guid isPermaLink="false">http://authoritativeopinion.com/blog/?p=320</guid>
		<description><![CDATA[Adapted from an email to the pubforge list. Solr is a great application, and its out-of-the-box features still amaze me. With the newer versions, it’s incredibly easy to hook Solr up to any data source (using the Solr Data Import &#8230; <a href="http://cbeer.info/blog/2010/03/19/npr-api-solr/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><i>Adapted from an email to the pubforge list.</i></p>
<p><a href="http://lucene.apache.org/solr">Solr</a> is a great application, and its out-of-the-box features still amaze me. With the newer versions, it’s incredibly easy to hook Solr up to any data source (using the Solr <a href="wiki.apache.org/solr/DataImportHandler">Data Import Handler</a>) and just let it do its thing.</p>
<p>I don’t have any thoughts about communication, but one of the tennents of the code4lib community is “less talk, more code”. Public media spends a lot of time planning collaborations or trying to find funding (or worse, talking about doing those things) instead of actually doing it. I&#8217;d love to see more prototyping, iterative development, and open sharing and discussion about what new and interesting services we can provide.</p>
<p>On an earlier post to the list, John Tynan suggested the potential of providing a &#8220;More Like This&#8221; service for NPR News data, and in the interest of just getting something out there, I spent a little bit of time hooking everything together. To give it a pretty front-end, I also hacked in a <a href="http://github.com/evolvingweb/ajax-solr">Solr AJAX</a> interface.</p>
<p>The <a href="http://cbeer.info/~chris/npr-solr/npr.html">NPR/Solr demonstrator</a>  uses this <a href="http://publicmediatech.com:8983/solr/select">solr endpoint</a>. I&#8217;ve locked down the indexes, but left everything else open so you can see how the pieces fit together. If there is enough interest in this application, I would be willing to develop it out further if you provide ideas, use-cases, etc in the comments.</p>
<p>The source code is available from the github project <a href="http://github.com/cbeer/npr-solr">npr-solr</a>.</p>
<p>None of this took very long to develop, the most time consuming part was importing from the paginated NPR API (with its absurdly low 20 records-per-request maximum..).</p>
]]></content:encoded>
			<wfw:commentRss>http://cbeer.info/blog/2010/03/19/npr-api-solr/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
