Variations, a digital library system dedicated to music, is now available as an open-source project. No, Variations, will not replace your iTunes/WinAmp/Pandora/etc. But for universities and other organizations that want to make a large collection of music available to students, it might be just the thing they have been waiting for. In addition to basic audio playback, Variations includes several tools for analyzing and studying music.

Variations was the first digital library project I worked on, so I am thrilled that it can finally be shared with the rest of the world. Although much work has been done in the years since I left the project, I’m happy to see that some of my code was worth keeping around. In fact, if you know where to look, you can still find me listed as author of a few files.

Congratulations to the Variations team!

(And thanks to my fellow developer Jim for reminding me that I need to blog about important things like this.)

The 2008 OpenRepositories conference was a lot of fun, but extremely tiring. Many of the days lasted 13 hours, and one day went to 15. As one of my fellow attendees stated, someone was trying to make sure we experienced every last bit of “conferencely goodness”. This, combined with the fact that I had some non-conference work to do, meant that I had to consume a lot more caffeine than normal.

Despite a few minor irritations, the conference went very smoothly. A large amount of technology was used, and for the most part worked as expected. Although the schedule was packed full, everything ran on time. Kudos to Les Carr and the team at Southampton!

A major achievement of this conference was the development of a conference repository. Most of the papers/presentations were collected beforehand, and it was very useful to refer to them during talks. Unfortunately, the content from the DSpace and Fedora user group meetings is not available yet; I hope it will eventually appear, because there were some great talks in both.

There was an increased focus on scientific data in repositories this year, starting with a keynote by Peter Murray-Rust that described the outlook for repositories from the viewpoint of chemistry researchers. Some of the points he brought up were echoed in many other talks throughout the conference:

  • Get into the researchers’ authoring stream as early as possible. One method that seems to be making headway is to propose the repository as a (dark) backup for the scientist’s local machine. This puts the content into the repository immediately, and there is little effort required of the scientist when the time comes to make it public.
  • Repositories must focus on text mining and other automated methods for metadata generation because “scientists hate metadata”.
  • PDF can cause serious problems for automatic processing methods. It is often better to locate the document that was used to produce the PDF, and process that instead.

I was surprised to see that code for new repository functionality is coming out at an astounding rate, to the point where I no longer have time to keep up with everything. Here are just a few of the announcements I remember:

  • The SWORD project has released a client for producing SWORD ingest packages, and server-side tools to ingest these packages into four major repository platforms.
  • The NSDL is starting to release most of the tools they have developed for their Fedora system, including the OnRamp/OnFire enhancements to Fez. It looks like OnRamp/OnFire will be rolled into the main Fez distribution, while other tools are available from the NCore Sourceforge space.
  • Several add-ons to DSpace 1.5 have been released by Graham Triggs and Tim Donohue. (See the HOWTO Category on the Dspace wiki.)
  • Sneep, the Social Networking Extensions for EPrints, should be available in May.
  • Within the Fedora community, there are quite a few new projects releasing code. Muradora and eSciDoc are the most interesting to me, but I’m sure there are others I missed.

Another major development is Microsoft’s entry into the repository arena. When this was announced shortly before the conference, I was extremely skeptical. Within libraries and universities, there has been a backlash against vendors, to the point where most people working on a “serious” repository won’t touch a product unless they can see the source. Even if Microsoft decides to open-source the repository itself, it depends on closed-source pieces, including the .NET framework and SQL Server. Due to these factors, I’m unlikely to even play with the new repository software, but I don’t wish Microsoft ill. I was incredibly surprised at the amount of negativity directed their way by some of the other conference attendees. I can understand frustration from the Fedora community, because the new system mirrors many features of Fedora, but some people seemed offended by the simple fact that Microsoft sent representatives to the conference.

What does Microsoft’s move mean for the future? I have no idea. Microsoft has a hit-and-miss record when entering new markets, and only time will tell if they manage to build something that customers want. Regardless, it means this whole idea of repositories is really starting to catch on, because the big kids are paying attention.

The conference ended with the official European launch of the OAI-ORE standard (called simply “ORE” by most people). So far, the greatest success of ORE is in getting attention from influential people. While there are a few demonstration systems, it is unclear just how useful the standard will be. In some ways, ORE is a dumbed-down version of METS. But the simplicity of the basic standard (assuming there is eventually a simple set of documentation) will appeal to many, and the use of arbitrary graphs rather than hierarchical structure means that ORE can handle a few types of information that are painful to represent in METS. However, note that ORE is an abstract model, while METS is a concrete data format, so it is theoretically possible to represent ORE information in METS format, though this may not be useful.

A few other notables:

  • All three of the major repository systems are coming out with new versions. DSpace 1.5 and EPrints 3.1 are available now, and Fedora 3.0 is in testing. I don’t know how much has changed in EPrints, but the DSpace and Fedora releases both represent major upgrades.
  • A large number of groups are building new systems for managing user accounts, most based on OpenID or Shibboleth.
  • Many are starting to view non-library systems like Flickr and Facebook as part of the repository ecosystem.
  • Quite a few Australian projects are using the Australian METS profile. I need to take a closer look at that.
  • I thought that I was working on a type of project that only a few other people in the world cared about — linking publications and the data used to create them. All of a sudden, everyone I meet is working on this exact problem!

One last result of the conference: I made a decision. Due to various events in my life, this blog has been on the back burner for a long time. No more. There are some pressing things that need to be said about repositories in general, and the Fedora vs. DSpace question in particular. During the past year, circumstances caused me to switch from the Fedora world to the DSpace world. Predictably, many of my conversations at the conference revolved around this switch. Now that my mind has started to truly process the problem, it is time to lay out the details. Fedora vs. DSpace. No holds barred. Coming soon.

Last week, the Encyclopedia of Life (EOL) released the first public version of their system. Although they had a huge amount of traffic, most critical reaction was neutral and/or negative. Notably, Rod Page posted a review which was categorized under the heading “suck”. And the comments from Slashdot were largely negative as well.

In general, it seems that no one out there understands what a difficult task the EOL is taking on. Yes, they have a lot of money and a lot of press/hype. Fine. Hold them to high standards. But, don’t hold them to an unrealistic timeline.

I’ve been involved in various digital library projects over the last 6 years, and I have never seen a serious project like this come together in less than two years. It takes time to find the right people for the job, time for those people to agree on the proper technical architecture, time to agree on policies, time to actually implement the system, time for testing, etc. This isn’t something that can be sped up by adding an extra $10 million or by using a “silver bullet” technology (*cough* Rails *cough*). If you want a high-quality system that will scale to large amounts of data and large amounts of use, while ensuring the longevity of the data, it’s gonna take two years. Or more.

You might argue that these technologies have been around for a while, and should be simple to put together into a new system. Unfortunately, as each new digital repository is built, there is a standard set of questions to be answered. A small sample of these questions:

  • How do I obtain content?
  • How do I massage content to fit my internal format?
  • What form of identifiers should I use?
  • How much metadata should I capture?
  • Can I get metadata automatically?
  • Can I build a system that encourages users to augment/correct the metadata?
  • Should I build the system from scratch, or build off an existing repository framework that doesn’t quite fit my needs?
  • What kinds of search/discovery make sense for this content, and what is the best way to implement them?
  • Am I violating any copyright or licensing restrictions by redistributing the content?
  • How can I present the content in a user-friendly way, while still accurately reflecting any copyright or licensing restrictions?

Answering each of these questions takes a certain amount of thinking, discussion, and implementation, all repeated until the solution is satisfactory. (If the answers to these questions are already known before the project starts, that means the project is just duplicating the functionality of another system, and there is no reason to build it in the first place.) Did I mention that this process takes about two years?

I’ve met a few people involved with the EOL. They are all smart people with good intentions, and a willingness to work with the biology community to build something truly worthwhile. They were forced to release an alpha version of their product in less than a year, primarily due to the schedule of the TED conferences. Yes, the current system is not overly impressive. I’m not worried about that. I’m much more concerned about where they go from here. Check the EOL site at this time next year, and then tell me what you think.

Why should you read this? Frankly, I have no clue. I’m not even quite sure why I’m writing I. it suppose I could say Stevey told me to. Or I could claim that the Great and Powerful Flying Zumwalt wasn’t tired of hearing my ramblings yet.

But the truth is, I’m writing this for me, not for you. Well, that’s not quite right. See, I’m writing it for me, but I want you to read it. Otherwise, why would I bother to publish it? I could easily keep it with my personal files (and I have a ton of those). But I have a desire to write something, and some of it just begs to be read. Yes, that’s it, I’m begging you to read this! Well, no, not quite like that. Perhaps I can explain by telling you about Dave.

Dave Sim used to write a comic book. Once in a while, he expressed controversial ideas. Readers would write in with statements like “I’m offended/bored, so I’m going to stop reading your book.” Dave would always reply that he didn’t care what the readers thought, he just had to write what was in his head. BUT, if the readers didn’t exist, Dave would have to find some other way to make money, and he wouldn’t have enough time to write all the things that were in his head.

So by publishing my ramblings, I suppose I am looking for an audience. Enough to justify the effort I put into writing them down. And if my audience happends to influence my ramblings, well, that’s just fine. We all win. Stick with me for a while, and you might like it. I don’t know quite what I’ll write yet. I have a few ideas to get me started. I’m certain to write something or other about technology, libraries, and business. I’ll probably write a bit about life in general. I haven’t decided how much to focus and how much to let my ramblings run wild. I’m sure that eventually things will go in a direction I never intended. But hey, we can all go down the rabbit hole together.