<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Musings &#187; digital repositories</title>
	<atom:link href="http://cbeer.info/blog/tag/digital-repositories/feed/" rel="self" type="application/rss+xml" />
	<link>http://cbeer.info/blog</link>
	<description></description>
	<lastBuildDate>Sun, 05 Sep 2010 02:47:05 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1-alpha</generator>
		<item>
		<title>Fedora and Microservices</title>
		<link>http://cbeer.info/blog/2010/03/04/fedora-and-microservices/</link>
		<comments>http://cbeer.info/blog/2010/03/04/fedora-and-microservices/#comments</comments>
		<pubDate>Fri, 05 Mar 2010 00:38:49 +0000</pubDate>
		<dc:creator>chris</dc:creator>
				<category><![CDATA[Repository]]></category>
		<category><![CDATA[digital repositories]]></category>
		<category><![CDATA[fedora]]></category>
		<category><![CDATA[microservices]]></category>

		<guid isPermaLink="false">http://authoritativeopinion.com/blog/?p=300</guid>
		<description><![CDATA[In this post, I want to discuss repository architecture philosophies, although I will focus primarily on Fedora and California Digital Library microservices, there are some generalizations one can pull out of this. It would also be interesting to pull in &#8230; <a href="http://cbeer.info/blog/2010/03/04/fedora-and-microservices/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>In this post, I want to discuss repository architecture philosophies, although I will focus primarily on Fedora and California Digital Library microservices, there are some generalizations one can pull out of this. It would also be interesting to pull in some very different repository models, like iRODS or a triple-store-backed system, but that&#8217;s outside of my expertise.</p>
<h3>The basics</h3>
<p>This is not a section I really want to write, but I don&#8217;t know of a high-level answer to  &#8220;when we say repository, this is what we mean&#8221;. I spent a little time looking around for a summary, but more often than not I found more questions (or, perhaps more useful yet inappropriate for my purposes, technology-based answers rather than use-driven), so I&#8217;ve taken a stab at addressing what I believe are some key issues: </p>
<p>Repositories are a collection of services, with well-defined interfaces, for storing and managing data (both content and metadata) in a format-neutral, display-independent manner way.  Repositories can be used as preservation repositories,  as access repositories, as centralized aggregations of far-flung data, etc and operate on any scale for any audience. Furthermore, there are existing standards and agreements about what it means to be a certain type of repository (TDR, OAIS, etc). All of these repositories, however, share some common services &#8212; whether implemented as software, external processes, or manual processes.</p>
<p>Some essential repository services are:</p>
<ul>
<li>Identifier services, which may include assignment + registration</li>
<li>Storage services (although the content stored may be only pointers to the &#8220;actual&#8221; content)</li>
<li>Content identification, matching identifiers to content items</li>
<li>Ingest workflows</li>
<li>Access mechanisms</li>
</ul>
<p>Without these services in place, a repository system would face some difficult obstacles in creating and providing value-added services. Repositories may provide multiple flavors of these services, some of which may be defined in generally accepted standards, models, and specifications.</p>
<p>Other basic services which operate on top of the above services are fairly common in most well-developed repository frameworks include:</p>
<ul>
<li>Dissemination services, to transform repository data into other forms + formats</li>
<li>Authorization services</li>
</ul>
<p>More advanced services may include:</p>
<ul>
<li>preservation services, including checksum (generation + verification), file format migration, support for models like LOCKSS</li>
<li>relationship services, using an RDF triplestore or similar, offering SPARQL endpoints, interferencing, etc</li>
<li>discovery services, using Lucene/Solr/etc, to provide relevancy, optimized user experience, drill-down faceting</li>
</ul>
<p>These more advanced services are likely separate applications in the repository ecosystem and are generally useful utilities independent of any repository system. Repositories generally integrate with these external applications in a modular, mix-and-match manner using well-defined interfaces.</p>
<h3>Fedora</h3>
<p>One approach to repository services is the &#8220;repository-in-a-box&#8221; model, where you can install and configure a base set of services provided by a single application. Within this group of services, Fedora provides a very basic implementation of the core repository services (vs a full-stack application like DSpace, which provides production-ready user interfaces). Fedora bills itself as a Flexible, Extensible Digital Object Repository Architecture.</p>
<ul>
<li>Identifier services, through PIDGen which provides sequential identifiers per-namespace</li>
<li>maps http uris to deferenceable uris to files</li>
<li>REST + SOAP APIs for Ingest + Delivery</li>
<li>Dissemination services using WSDL</li>
<li>Authorization using XACML (and authentication using a number of plugins)</li>
<li>Integrates with the Mulgara triplestore and a Lucene index (by default)</li>
</ul>
<p>Fedora provides a many opportunities for customization and enhancements through custom development:</p>
<ul>
<li>the Fedora REST, SOAP, and triple-store APIs allow developers to build  on top of low-level services, which may include access interfaces, administrative interfaces, or otherwise</li>
<li>the Fedora application provides Java Messaging Services (JMS) events when objects within the repository are created, deleted, or modified, and developers can build applications that listen to these events  and trigger actions (Shelver &lt;<a href="http://yourmediashelf.com/blog/2010/03/01/blacklight-activefedora-and-shelver-interplay-between-searching-managing-and-indexing-in-a-repository-solution/">http://yourmediashelf.com/blog/2010/03/01/blacklight-activefedora-and-shelver-interplay-between-searching-managing-and-indexing-in-a-repository-solution/</a>&gt;, fedora-workflow &lt;<a href="http://github.com/cbeer/fedora-workflow">http://github.com/cbeer/fedora-workflow</a>&gt;, etc)</li>
<li>the Fedora application is build modularly, and Java developers are able to develop and use components as needed, if they conform to the Fedora interfaces</li>
</ul>
<p>As services go beyond the basic, common applications present in institutional repositories, enhanced repository services require custom development or supplemental services outside of the repository services. For most, this includes integration with a more advanced search provider (like Solr). At some point,  additional services can blur the lines between the repository services and front-end user interfaces (which have to respond to local customization to meet user needs).</p>
<p>Repository-independent services, or third-party services, require some wrapper to make them interoperable with the Fedora APIs, which makes integration with existing technology more difficult. Even Duraspace&#8217;s Duracloud offering is (currently) built as separate services with some possibility of storage-level integration. Preservation support services will bypass the repository APIs and provide those services against the file system instead.</p>
<p>Considering the services Fedora doesn&#8217;t provide or the obstacles Fedora creates in integration, many ask why they should start using Fedora anyway. The strongest response to this, I believe, is that it provides a common structure to basic repository services, while at the same time not creating major obstacles to future expansion or migration outside Fedora. Out of the box, Fedora provides a set of &#8220;training wheels&#8221; (ht Mike Giarlo &lt;<a href="http://lackoftalent.org/michael/blog/">http://lackoftalent.org/michael/blog/</a>&gt;) for repository services development that can be removed when unnecessary, but in the meantime offers structure for the creation of new repositories and support for repository services as needed.</p>
<h3>CDL Microservices</h3>
<p>Another approach to repository services are &#8220;microservices&#8221; like those designed by the California Digital Library (CDL), provide standards and specifications for individual repository services, which form a structure for standardized, mix-and-match repository services that can integrate, interoperate and take advantage of  existing technology independent of a repository application like Fedora. This, conceivably, allows all domain developers to take advantage of these common projects without using a specific technology. CDL provides microservices specifications for:</p>
<ul>
<li>identifier assignment + registration, using NOID, which can act as a CLI tool or a CGI service</li>
<li>file-system structures, using the Pairtree convention</li>
<li>data exchange and verification, using BagIt</li>
<li>access standards, using the ARK URL format</li>
</ul>
<p>The standards are developed inline the &#8220;UNIX philosophy&#8221;:</p>
<blockquote><p>  Write programs that do one thing and do it well. Write programs to work together.  &#8212; Doug McIlroy
</p></blockquote>
<p>These basic services can be organized and crafted using the existing capabilities in web servers, file systems, etc. More advanced services can act within this structure, using individual standards when needed. While significant development and customization may be required to get a microservices architecture to a useable state, the end result is more flexible and targeted to an institutions needs.</p>
<h3>Flexing Fedora</h3>
<p>These two approaches are certainly not incompatible, and Fedora is quite capable of using some of these micro-services standards under the hood (replacing custom developed approaches to these basic services). By taking this approach, Fedora could act as a management application on top of generic repository data, allow both Fedora-based and microservices-based services to operate on the data, and make it easier to reach around Fedora when necessary (or, go so far as to remove it entirely).</p>
<p>What follows is a short summary of on-going work in this area, which mostly focus on removing the Fedora-centric definitions of /how/ or /where/ services act. The majority of these ideas build on new developments and best practices (since Fedora was initially created) in the repository community as a result increased adoption or awareness of issues. Where available, I&#8217;ve included links to projects in-the-works.</p>
<p>Some of this work is quite easy to do:</p>
<ul>
<li>integration of NOID identifier services by creating a web-services consumer for Fedora identifier assignment &lt;<a href="http://gist.github.com/273584">http://gist.github.com/273584</a>&gt;</li>
<li>replacing the custom, timestamp-hash file store with a Pairtree structure (the prototype is limited, however, by Fedora&#8217;s hard-coded distinction between object and datastream filestores &lt;<a href="http://gist.github.com/280020">http://gist.github.com/280020</a>&gt;</li>
<li>using memento http headers to provide versioning &lt;<a href="http://www.fedora-commons.org/jira/browse/FCREPO-604">http://www.fedora-commons.org/jira/browse/FCREPO-604</a>&gt;</li>
</ul>
<p>Other projects that are more involved, and require more work than just creating new modules for Fedora:</p>
<ul>
<li>BagIt and SWORD ingest and dissemination options to replace the custom Atom structure &lt;<a href="http://fedora-commons.org/confluence/display/FCSVCS/SWORD-Fedora+1.2">http://fedora-commons.org/confluence/display/FCSVCS/SWORD-Fedora+1.2</a>&gt;</li>
<li>Integration of arbitrary ingest of structured data (perhaps similar to CDL&#8217;s 7train &lt;<a href="http://seventrain.sourceforge.net/">http://seventrain.sourceforge.net/</a>&gt;?)</li>
<li>Pluggable authn/authz, through the FESL project, JAAS should provide a pluggable authentication backend &lt;<a href="http://www.fedora-commons.org/confluence/display/DEV/Fedora+Enhanced+Security+Layer">http://www.fedora-commons.org/confluence/display/DEV/Fedora+Enhanced+Security+Layer</a>&gt;</li>
<li>support for arbitrary RDF metadata, forget RELS-EXT/RELS-INT &#8212; force that kind of decision into a disseminator and use a seamless API to pull back RDF triples (/object/{pid}/relationships) &lt;<a href="http://www.fedora-commons.org/confluence/display/DEV/Supporting+the+Semantic+Web+and+Linked+Data">http://www.fedora-commons.org/confluence/display/DEV/Supporting+the+Semantic+Web+and+Linked+Data</a>&gt;</li>
</ul>
<p>More advanced microservices integration is highly involved and would require a major re-work of the application:</p>
<ul>
<li>Two-way messaging queues (or file alteration monitors, or database update hooks) to allow Fedora to receive updates</li>
<li>decreased reliance on self-generated registries, I think the situation is getting better, but I&#8217;m not sure its fully there..</li>
<li>pluggable storage modules with intelligent filtering, routing, multiplexing, and rules mechanisms &#8212; the Akubra project may be doing (part of?) this &lt;<a href="http://www.fedora-commons.org/confluence/display/AKUBRA/Akubra+Project">http://www.fedora-commons.org/confluence/display/AKUBRA/Akubra+Project</a>&gt;</li>
<li>workflow support hooks, to allow integration and automation of workflow tools  (possibly a result of Hydra?)<br/>
</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://cbeer.info/blog/2010/03/04/fedora-and-microservices/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Federated/distributed digital repositories</title>
		<link>http://cbeer.info/blog/2008/10/26/federateddistributed-digital-repositories/</link>
		<comments>http://cbeer.info/blog/2008/10/26/federateddistributed-digital-repositories/#comments</comments>
		<pubDate>Sun, 26 Oct 2008 15:42:01 +0000</pubDate>
		<dc:creator>chris</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[bvault]]></category>
		<category><![CDATA[digital repositories]]></category>
		<category><![CDATA[federated repositories]]></category>
		<category><![CDATA[fedora]]></category>

		<guid isPermaLink="false">http://192.168.2.101/wordpress/?p=3</guid>
		<description><![CDATA[For the bVault project I am developing, one of our secondary goals is to create a replicable model for other digital media repositories. One of the ways we are pursuing this is to lay the foundations for an interface to &#8230; <a href="http://cbeer.info/blog/2008/10/26/federateddistributed-digital-repositories/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>For the <a href="http://launchpad.net/bvault">bVault</a> project I am developing, one of our secondary goals is to create a replicable model for other digital media repositories. One of the ways we are pursuing this is to lay the foundations for an interface to a federated/distributed repository among other public broadcasters, which takes advantage of one of the architectural features of public broadcasting in the US‚ the public broadcasting network is really a federation of individual stations that subscribe and contribute to a particular programming distribution service (PBS and NPR among others)</p>
<p>A federated repository ultimately needs three things:</p>
<ol>
<li>A common API among the participating repositories,</li>
<li>A search index that covers all the repositories, and</li>
<li>A resolver to translate a search result back to the originating repository</li>
</ol>
<h3>Common API</h3>
<p>For bVault, the common API is the set of web services exposed by Fedora, and the metadata translation dissemination service behind that, which allows a client to receive a particular metadata format, regardless of the underlying schema. This is an important feature, because it allows individual repositories to use whichever metadata format is most natural to their needs, while seamlessly generating interoperable metadata.</p>
<h3>Search index</h3>
<p>The exact methods employed to generate a spanning search index are essentially arbitrary. Solr provides some <a href="http://wiki.apache.org/solr/DistributedSearch">distributed/sharded</a> search capabilities, but the index could also operate on a pub/sub model where repositories push content out to a master search index, or with a search engine like crawler using <a href="http://openarchives.org">OAI-PMH endpoints</a> for the repository. Because the search index is loosely coupled to the whole system, it ultimately is an architectural decision rather than a technical one</p>
<h3>Distributed Resolver</h3>
<p>Now that we have a way to discover items within a repository, the interface needs a way to extract the content from the origin. For this, we need a way to resolve a unique resource identifier (URI!) back to its source. Again, the method is somewhat arbitrary, but for this project, we elected to require unique namespaces for each repository (quite reasonable, considering the application).</p>
<p>To do this, I‚Äôve slipped a namespace resolver into the client‚Äôs API call to allow the interface to act independently from the source of the content. For a simple API call, like listDatastreams, we have:</p>
<pre name="code" class="php">public function listDatastreams($pid, $asOfDateTime = null) {
      return Fedora_Repository::get('API-A', $pid)-&gt;listDatastreams(array('pid' =&gt; $pid,
                    'asOfDateTime' =&gt; $asOfDateTime));
}
</pre>
<p>This requests the API-A binding appropriate to the current persistant identifier (pid):</p>
<pre name="code" class="php">/**
  * Retrieves a Fedora Repository that can provide the $type endpoint for the PID/prefix $prefix
  *
  * @param string $type
  * @param string $prefix
  * @return Fedora_Repository
  */
static public function get($type, $prefix = '') {
     global $objManager;

     $arrRepository = $objManager-&gt;resolve($prefix);
     $objClient = new stdClass;

     if(count($arrRepository) == 1) {
           $objClient = $arrRepository[0]-&gt;getSoapClient($type);
     } else {
           $arrKey = array_rand($arrRepository, count($arrRepository));

           foreach($arrKey as $key) {
               $objClient = $arrRepository[$key]-&gt;getSoapClient($type);
               if($objClient !== false) {
                     break;
               }
           }
      }

      if($objClient instanceof SoapClient) {
            return $objClient;
      } else {
            return false;
      }
}
</pre>
</p>
<p>Creating a distributed repository doesn‚Äôt cost much now, and if you design it right, you can benefit from the potential for redundancy and mirroring immediately, even before there is a federated network to tap into.
</p>
<p>The full source is available from the <a href="http://bazaar.launchpad.net/%7Echris-beer/bvault/wgbh/files/9?file_id=fedora-20080924210122-b7675owtu9oq690p-28">bVault Fedora PHP library</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://cbeer.info/blog/2008/10/26/federateddistributed-digital-repositories/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
