Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Harvard puts metadata for 12M library items into the public domain (hyperorg.com)
115 points by vgnet on April 25, 2012 | hide | past | favorite | 24 comments


"The records consist of information describing works—including creator, title, publisher, date, language, and subject headings—as well as other descriptors usually invisible to end users, such as the equalization system used in a recording. "

I'm having a hard time thinking of what could be done with this data besides a library catalog.


I used to work in libraries for awhile and there's a pretty wide range of things this is useful for.

Lot's of interesting data analysis similar to what people are doing with google n-gram data for culturomics[0]. For example since you have publication year and subject heading you can look at the shift in popularity of certain subjects over time. I remember for fun I once plotted the life spans of various people by there area of research (art, math, sciences etc) it was interesting because there did seem to be some trends.

If you're doing any text classification research you now have a great way to label data if you just have title and author data. Or if you have texts with poor metadata you might be able to use this set to clean that up.

For libraries themselves I would love to see some machine learning approaches to cleaning up messy records, or just replace bad records with good one directly.

The big thing is that this is a very large set of curated bibliographic metadata from a reputable source. If you have any large project related to books (directly or indirectly) this could be a huge asset

[0] http://en.wikipedia.org/wiki/Culturomics


You could merge this data with an online bookstore to get something that's like a library catalog in some ways but lets you buy the books.

If you're interested in "generic databases", like myself, books (and other creative works) are interesting because they are about various topics. Knowing what books have been written about what subjects is a bit like having preference data from millions of users, except there's more coherence, so you can do more with less physical data.


Here is an example of what you might do with this data

http://www.worldcat.org/identities/


What is your point?


What is this data good for?


The first thing that comes to mind is that I should never have to type more than a couple of characters into a citation manager when adding a book.

Outside of the numerous uses for citations, reading lists, etc. I imagine that this is a very interesting dataset for researchers in publishing and library sciences. It is also a great resource for anyone developing library related software.


Uh, for applications and services that store information about books?


Why not link directly to the official press release: http://isites.harvard.edu/icb/icb.do?keyword=k77982&page...


It's about time they did this. Harvard's was about the only major library that didn't allow Z39.50 access to their full MARC21 records. As a private individual with a large rare and antiquarian book collection, I welcome the news, since I've found that Harvard sometimes has the only other copy of a book I'm cataloging. A few other libraries require you to jump through some hoops to get to the data (British National Library, for example), but Harvard was shutting everyone other than faculty and alumni out.


I hate that the article title says "Big data for Books".

Here's a hint on how you can get a sense of whether you are dealing with "Big Data": IF I CAN FIT IT ON A THUMB DRIVE, IT ISN'T BIG DATA.


The thing is, bibliography cannot be "big data" by that definition. 12 millions items is nearly 10% of Google Books catalog.

http://booksearch.blogspot.com/2010/08/books-of-world-stand-...


The direct links for API access and download (3.16GB) is given in the DPLA Dev Blog: http://blogs.law.harvard.edu/dplatechdev/


It seems that bittorrent would be the logical choice for distributing the dataset. I wonder if this is an oversight or if they are not expecting many people to download the dataset...


Almost certainly the former, but this was my first thought as well. I don't understand why more distributors don't at least provide this as an option; it saves them the bandwidth costs...


Burnbit.com provides a service that will turn any open URL into a bitorrent seed.


"BitTorrent is illegal!!1111!"


I'm not sure who would need to actually fill out the submission form. But wouldn't this: http://aws.amazon.com/publicdatasets/#3 be convenient for working with a data set like this?


Universities are doing some pretty cool stuff with data. Every tech uni is now getting their students to work on social media data analysis. More exciting than entity relationship diagrams...


It's going to be interesting to see what people build and/or analysis they do with this data.


>Finally, note that Harvard asks that you respect community norms, including attributing the source of the metadata as appropriate.

That's not what "public domain" means. If they wanted attribution, there are licenses for that. "Public domain" means that it belongs to all of us now. In that case, attribution is meaningless.


It is possible to live life at a higher standard than simply the minimum legal standard. In many communities, norms of attribution go well beyond what is required by the copyright law.

For example, in academic writing, it is still considered plagiarism to copy from public domain texts without attribution, even if it is not copyright infringement to do so. Accordingly, academic writers attribute their quotes of public domain texts, even though there's no legal need to do so.

Here, Harvard is asking nicely that people attribute the data. In academic contexts, that's perfectly normal and reasonable. Anyway, it's just a request, feel free to ignore it -- you're right that you are legally free to do whatever you'd like.


There's a difference between attributing because you're required to and attributing because it's a community norm. Especially in academia it's common to give attribution to your sources even if you're not required to: it's considered courteous and gives your work more credibility if you mention a credible source.


While I'm not a huge fan of using the term for this sort of thing either, the license expresses it pretty clearly: http://creativecommons.org/about/cc0




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: