Highlights from public television websites

When creating the comprehensive gallery of station website (and wrote about in my last post), I wanted to encourage a conversation about what makes a good public media websites (c.f. this XKCD comic about university websites). In this first analysis post (which is highly subjective and incomplete), I will draw out some highlights from a quick pass of public television home pages (a radio run-down will come later — the television corpus is significantly smaller and easier to digest).

Continue reading

Posted in Public Media | Leave a comment

A gallery of public media organization websites

Last night’s topic for #pubmedia chat on twitter was station websites. Because I happen to have a list of public radio stations metadata (gathered from both NPR’s station finder API and the PTFP Public Radio Coverage 2004 report, supplemented with FCC and Arbitron data), I thought it’d be interesting to quickly toss the results into a image gallery and see and compare the different sites.

In about 40 minutes, I had a very basic, very ugly gallery up, and today I’ve relaunched it as a Ruby on Rails application still in development at http://stations.publicmediatech.com/. Because of the peculiarities of the data sources, there is some repetition of station websites (the data from PTFP is transmitter-based, so translators, state-wide networks, and other entities count as unique organizations) as well as a very broad definition of public media (including college, community and low power radio). The dataset is also missing a large swath of television-only broadcasters, although I hope to get a dataset shortly.

My ideas for the future of the interface revolve mainly around making the sites more discoverable, including:
- Ingesting metadata into Solr, to support more powerful searching and faceting(implemented 9/4, using ruby-sunspot)
- Crawling the websites to extract full-text content (and some metadata) to support searching (e.g., last modified times, platforms and frameworks, etc) (first phase implemented 9/4)
- User generated content (tags, comments and ratings) to help organize the information. (implemented 8/31)

I took the screenshots using the nifty webkit2png python script (`for i in $( cat $f); do python webkit2png-0.5.py -D thumbs -d -s 0.5 $i; done;`) . The code to the interface is available through my github account, feel free to fork it and add features!

Posted in Public Media | Leave a comment

From a shared blog to a personal site..

To ease the transition from the previous incarnation of this blog (a shared blog) to a more focused, personal blog, I used the WordPress Import/Export feature to transfer all of my own posts into the new WordPress instance. In order not to disrupt other contributors and leave the old history intact, I whipped up this quick plug-in to redirect requests for my posts to the new blog:

/*
Plugin Name: Author posts redirect
Plugin URI: http://cbeer.info/blog/2010/07/18/from-a-shared-blog-to-a-personal-site
Description: Redirect an author's posts to a new url..
Version: 0.0a
Author: Chris Beer
Author URI: http://cbeer.info
*/

add_action('the_post', 'redirect_author_post');

function redirect_author_post($post) {
  if($post->post_author == 2 && is_single()) {
    header("HTTP/1.1 301 Moved Permanently");
    header('Location: ' . str_replace('http://authoritativeopinion.com/', 'http://cbeer.info/', $post->guid ));
    die();
  }
}
Posted in Code | Leave a comment

Public Media Camp Boston

I have the great fortune to be involved in planning the Boston spin-off of Public Media Camp with a number of people from the Boston media community. It is an interesting process and has probably taught me more about media policy, institutional politics, and event planning than I’d ever want to know, but a couple key things stand out:

  1. Distributed communication is a challenge, fraught with false starts and misunderstandings. While this is certainly a social problem, there isn’t much technology out there to help manage event planning sanely. Nothing beats a face-to-face meeting.
  2. Money is both the hard part and the easy part.
  3. Back-channels are essential.
Posted in Public Media | Leave a comment

Digital Asset Management for Public Broadcasting: An Update

In the last month, I had some great help turning the digital asset management prototype into a grant proposal for the NEH Preservation and Access Research and Development program, focusing on the needs of moving image digital asset management using existing open source tools.

Posted in Repository | Leave a comment

Digital Asset Management for Public Broadcasting: Interlude

Just a quick update on my progress developing a shareable prototype. The basic integration work is functional, I’ve ripped out the previously-mentioned Camel workflow components in favor of ruote (which is so much easier to wrap my mind around — I’ve pushed the skeleton code for this out as a separate package called fedora-workflow), and I’ve started doing some very basic datastream display work.

After this work is complete, I think a first-round alpha will be ready to publish within the next couple weeks.

Posted in Repository, TODO | Leave a comment

Digital Asset Management for Public Broadcasting: Blacklight (Part 3 of ??)

In the previous parts, I wrote about two “back-office” open source applications (and tangentially discussed a few others) that are well-established in their communities and can support a wide variety of repository services. While it may be philosophically important that these are open source applications, I would argue that the next parts, in which I want to talk about services and applications on top of the repository infrastructure, are the more crucial and benefit tremendously from the ability to create and customize interfaces for specific use cases to the full extent necessary by anyone with a fairly broad skill-set.

Blacklight grew out of a next-generation library catalog interface, and while it still has very firm roots in the library world, it is also being used for archives, digital collections, and institutional repository interfaces. It is also an open source application, based on the Ruby on Rails framework.

Out of the box, it is a fairly generic interface to a solr index (with a little sprinkling of optional MARC data) and some relatively benign application features (users, bookmarks, saved searches). Connecting it to our existing Solr index is fairly trivial, and just requires some little configuration changes:

config[:index_fields] = {
    :field_names => [
      "dc.description",
      "dc.creator",
      "dc.publisher",
      "dc.subject",
      "dc.date",
      "dc.format"
    ],
    :labels => {
      "dc.description"           => "Description:",
      "dc.creator" => "Creator:",
      "dc.publisher" => "Publisher:",
      "dc.subject" => "Subject:",
      "dc.date" => "Date:",
      "dc.format" => "Format:"
    }
  }

Which gives you a very basic discovery interface into your collection.

Extending Blacklight to work with Fedora is also easy, so in less than 50 lines of code, I had full access to the Fedora web services APIs and SPARQL interface. Adding management interfaces was also simple, using normal Ruby of Rails techniques and with less than 500 lines of code, a passable repository manager interface was available and I could import assets and metadata.

Adding a security layer on top of the repository content is also easy, thanks to the work the UPEI team put into the DrupalServletFilter, which allows Fedora to authenticate users against any SQL database. Because of this, we can use the XACML policy language built into Fedora to do record-level security (which I confess, I don’t entirely understand, however, it is an enormously powerful and expressive language if you like XML verbiage). For storing re-use rights, I am very intrigued by the Open Digital Rights Language, which can integrate with Fedora and Blacklight to express non-object-security rights (re-use, segmentation, etc) using my proof-of-concept ruby-odrl.

With these fundamentals in place (ingest services, security policies, and resource discovery), one can build more advanced services on top of the repository, like collections, batch and on-demand conversion/transcode services, export/transfer services (one-click “export to PBS COVE”?) — and, because this can be done as rails plug-ins, they are readily sharable outside of this single application and provide templates for others to continue to develop and extend similar services to evolving platforms.

Because setting up a Blacklight application is so painless, it would be easy for public broadcasting institutions to create custom-made (yet shareable) modules and views for specific purposes (news, productions, archiving, etc) that all share the same back-end infrastructure yet offer users an easy way to interact with their data in a way that makes sense for their work. As I mentioned in my Fedora article, you aren’t limited to data you control and have locally, but can bring in data from external sources (say, pulling in metadata from the NPR API or an RSS feed from a stock footage house) and present it both coherently and cohesively.

I’m looking for a good source of freely available test data, and I would rather not invest too much time building a corpus of archival assets if there is something already existing. The biggest challenge I’m having is finding comprehensive metadata, but the closest I’ve come are some podcast feeds from sources like Democracy Now!, however that doesn’t capture the breadth of materials I’d like to demonstrate.

Finally, a couple requisite screen-shots now that there is something visual to work with, using the default Blacklight theme with some quick interface hacks.

Posted in Repository, TODO | Tagged , , | Leave a comment

Digital Asset Management for Public Broadcasting: Solr (Part 2 of ??)

The Lucene-based Apache Solr is an incredible platform for building decent search experiences with — especially compared to the “more traditional” database-driven approach with many SQL JOINs that it becomes difficult to efficiently add search features like stemming, ASCII-folding, term highlighting, facets, and synonyms which, I would argue, are essential parts of the discovery experience and you essentially get for free with Solr. Another benefit Solr provides is a foundation for many light-weight interfaces on top of a single index (or, across multiple indexes, because Solr enforces some decent scalability principles that make expanding to task-based indexes easier).

For a DAM project, each asset should appear in the search index with the basic layer of contributed metadata, relationships, metadata extracted from the assets, as well as the administrative metadata managed by Fedora. I would align the fields the the Dublin Core (and DCTerms) elements (which is probably all you can get users to contribute in any case). At this point, because legacy systems lack authority control, linked data, or otherwise, existing metadata is sparse, inaccurate, or limited, which means the entry-level bar is set pretty low, so targeting ease-of-use and metadata collection are the priorities. Eliding a lot of detail, here’s the skeleton schema:

  <field name="id" type="string" indexed="true" stored="true" required="true" />
   <field name="title" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="description" type="string" indexed="true" stored="true"/>

   <dynamicField name="dc.*" type="string" indexed="true" stored="true" multiValued="true"/>
   <dynamicField name="dcterms.*" type="string" indexed="true" stored="true" multiValued="true"/>
   <dynamicField name="rdf.*" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="text" type="text" indexed="true" stored="false" multiValued="true"/>
   <field name="payloads" type="payloads" indexed="true" stored="true"/>
   <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>

   <copyField source="title" dest="title_t" />
   <copyField source="subject" dest="dc.subject" />
   <copyField source="description" dest="description_t" />
   <copyField source="comments" dest="text" />
   <copyField source="dc.creator" dest="author" />
   <copyField source="dc.*" dest="text" />
   <copyField source="text" dest="text_rev" />
   <copyField source="payloads" dest="text" />

  <copyField source="dc.title" dest="dc.title_t" />
  <copyField source="dc.description" dest="dc.description_t" />
  <copyField source="dc.coverage" dest="dc.coverage_t" />
  <copyField source="dc.contributor" dest="dc.contributor_t" />
  <copyField source="dc.subject" dest="dc.subject_t" />
  <copyField source="dc.contributor" dest="names_t" />
  <copyField source="dc.coverage" dest="names_t" />

The new edismax query parser provides a great balance of flexibility, advanced query features, and ease-of-use that it seems like an obvious choice here.

The only penalty you pay by using solr is having to keep the solr index synchronized with your data sources. For synchronizing data from Fedora, there are now a proliferation of options, ranging from the task-specific with java plugins like GSearch and Shelver to the more generic (ESBs and all that) like Apache Camel or the Ruote-based Fedora Workflow component. Because DAM likely involves many different workflows, I lean towards the more generic solutions. Lately, I’ve given Camel a try, and after a couple days of java-dependency-induced head pounding, I have something that works.

On twitter, John Tynan requested a virtual machine image to encourage others to begin playing with this software, so I’ve actually begun building some of these pieces. Currently, I have Fedora/Camel/Solr/Blacklight installed and functional, but before I try to package it us, I feel like I should add an easy-to-use ingest system to get data in.

Posted in Repository, TODO | Tagged , | 4 Comments

Digital Asset Management for Public Broadcasting: Fedora Commons Repository (Part 1 of ??)

In my previous post, I provided a broad overview of the challenges and opportunities for developing an open source digital asset management system within the public broadcasting community, and described some fundamental technology that is already being developed and deployed within institutions. In this post, I want to look specifically at the role the Fedora Commons repository architecture can play in this environment. Additional reading is available from the Fedora Commons wiki, especially the Getting Start with Fedora article, which articulates some of the strengths of their approach in the abstract.

The Fedora Commons data model is built on top of the Kahn/Wilensky Architecture, which describes a data structure for primary digital objects (irrespective of the data or formats contained within). Already, this is an improvement over some systems, which differentiate between content types, relegating some content formats to second-class citizenship. By providing a single, fundamental data type, one can build consistent user experiences on top of the discoverable components and interact with the digital objects to GET THINGS DONE.

Within digital objects are datastreams, which may include both data and metadata about the object, and are treated equally (more or less…) Datastreams can carry revision information, integrity checks, and other provenance information. By not distinguishing between “digital” assets (for which data (e.g. the media files) are available electronically) and other kinds of assets (physical tapes, abstract entities, etc), an asset management system can encompass the full range of materials within an active media archive.

Digital objects can be assigned content model types, which stipulate the required (and optional) component datastreams, as well as define the services that operate on objects of that type. These content types are simply structured digital objects within the repository, allowing repository managers (and content creators, given a sufficient interface) to define the structure of their content rather than structuring their content to meet the needs of the digital asset management system.

Types of datastreams natively supported include Inline XML datastreams, Managed Content, Externally Referenced Content, and Redirects. The datastream types do not speak to the format of content stored within them (except for inline XML), which allows content creators to easily provide content to the repository without first worrying about transcoding materials or other barriers to accessioning content (which is certainly not to say that standardizing content types archived within the repository is problematic — just that it shouldn’t interfere with getting the materials in the first place). This variety of types allows content to be stored and managed in the most appropriate places, rather than arbitrarily requiring centralization or “physical” ownership of content. Within a distributed organization like public broadcasting, this could be a powerful concept that allows content creators to control and manage their content at various stages of distribution (and, while this could be accomplished within traditional database driven systems, it would require custom application logic to do, which is likely not scalable across a wide variety of applications, frameworks, and languages).

While all datastreams are equal, there are four (or more?) that are more equal than others:

- AUDIT, which stores the history of the digital object as it is modified.

- DC, a Qualified Dublin Core datastream, that provides a minimal level of interoperability for the most generic of repository management interfaces. This is also the only fundamentally required datastream (without specifying required elements within it), and really is the bare minimum of information necessary to assert the existence of an object (if it doesn’t have a title, identifier, or description, what is it we’re talking about exactly?)

- RELS-EXT (and INT), an RDF-XML datastream in which one can assert relationship to other digital objects (which may exist within the repository, but may also exist (or not exist) elsewhere). These relationships can be from any vocabulary and reference any type of object, which is handy when you are dealing with complex relationships between media archives assets. This datastream is also generally indexed in an RDF triple-store to provide relationship querying.

- POLICY, which stores XACML security policies for the digital object, which can be used to restrict access to the datastreams, services, or the object based on whatever the security needs are. Within the digital asset management context, this could also be used to restrict access to only media files, while still providing the metadata (so one could assert and describe the existence of an object, without actually sharing it for whatever reason, which seems atypical for some commercial solutions)

By default, these datastreams (and the digital object wrapper) are stored on the file system in relatively comprehensible ways, which is a bonus to implementors who can set up underlying hardware or other technology in traditional ways and just begin to use the software without too much fuss. There is ongoing development to build in support for additional and evolving standards around digital object storage, serialization, access, and other services which should only help with making the process as transparent as possible.

All of this technology and flexibility comes “free” with the repository architecture and doesn’t try to interfere with actually making use of the assets (except as restricted by security policies, of course), which allows different use cases to be expressed in the most logical and straightforward way (rather than trying to bend the use cases or system in an attempt to mimic some of the elements the user needs). As a starting point for developing a digital asset management solution for media, I believe it offers a good balance of flexibility and requirements that can ensure user needs are met without sacrificing durability.

So, how can Fedora be applied in a digital asset management context for public broadcasting? First and foremost, Fedora provides a trusted platform for managing and maintaining content for many different contexts (production, long-term archiving, etc) on top of a variety of hardware and standards. By managing metadata and data together, physical and digital assets can be revealed in a common interface (when appropriate) to meet the needs of researchers and scholars (for whom the knowledge of the existence of the asset is more essential than on-demand access). Finally, by offering a stable API to a variety of resources, use-case driven interfaces can be developed, shared, and maintained to meet different needs sensibly.

Posted in Repository, TODO | Tagged , | Leave a comment

Digital Asset Management for Public Broadcasting (Part 0 of ?)

Digital asset management is hard. Many people have solved many parts of the problem, but for a reasonably complex use-case, many of the existing solutions just aren’t there yet, especially within a vendor-driven world for a niche market within a niche market, which is concerned with all levels and life-cycles of an asset (from production, to reuse, to archiving and back again), which is almost certainly not a profitable market given public broadcasting budgets. I believe this is an ideal area for the development of open source solutions based on some existing works of open source software.

The “easy” part in the DAM ecosystem, I would argue, is archiving the material and ensuring its long-term preservation (and accessibility!). I’ve done a couple projects and prototypes now based on the Fedora Commons repository architecture, and it seems to be a promising platform for this kind of development. Objects and datastreams are stored on the file-system, which IT staff are traditional prepared to manage (vs some unique database structure almost certainly obfuscated in layers of (de-)normalization). Fedora will happily manage security policies, object relationships, data transformation services, and (shortly) more advanced file system interactions, which exposing a (relatively) consistent HTTP interface.

Discovery interfaces are probably the next easiest piece, having been examined and developed out of the information sciences communities. Using a combination like Solr and Blacklight (deployed successfully for WGBH’s Open Vault website), one can rapidly create interfaces to the underlying content that satisfy the many use cases. With Solr, you get a bunch of discovery mechanisms and options, including relevancy, term highlighting, faceting, etc.

From here, we start getting into the hard parts. Ingest and metadata editing is difficult to solve well in a content- and use-case- agnostic way, which is the approach most Systems seem to take. While the need for a generic asset management view is important (and solved!), if the collection of services fail to meet the needs of the users, encouraging adoption (nicely) is problematic. By using infrastructure elements with open and well-documented APIs, developers can extend and customize the user experiences to match the underlying data and processes. This is an area for which the adoption and support for open source projects can encourage sustainable development of these interfaces.

It seems like, after clearing these obstacles, many systems fail to account for the use and re-use of these objects within the media communities. Few systems account for batch encoding video and audio for web distribution, one-click publishing systems to blogs, social networking sites, or video portals, integration into broadcasting chains, etc — for very good reasons, there simply isn’t the incentive when faced with large upfront development costs for unique development. Given an open source platform, however, that supports (and encourages) sharable development of solutions, maybe we could start finding answers to these persistent problems (without re-inventing the wheel!).

I believe most of the core infrastructure pieces are there:
- Fedora, as I mentioned, which provides preservation and management services;
- Solr, which provides a discovery framework (and associated metadata extraction utilities like Tika);
- Blacklight, which provides discovery and access services;
- ESB or other workflow solutions like Camel, Ruote, or otherwise;
- Generic metadata editing options, like XForms, Django, etc;
- Open standards that allow for publishing and reuse (Atom, MediaRSS, RDF, ???);
- FFMPEG, which offers encoding and transcode services.

It isn’t an extensive development problem, these are well-established communities in their fields, it’s a simple matter of getting initial momentum in tying the complex pieces together and creating interesting and useful services on top.

So, why aren’t we doing this? Money, time, lack of a collaborative/communicative culture, and apathy (and acceptance) of second-rate, buggy commercial solutions that fail to address all aspects of a media objects life-cycle as it goes from the rapid iterations in production to many different distribution channels back to relative obscurity in an archival context (until a new production pulls it out again). Without full support, no step in the process can realize the potential of the content and have the incentive to put in the hard work to ingest and describe the asset.

Posted in Repository, TODO | Tagged , , | 2 Comments