Subscribe via RSS Feed

The Tenuous Nature of Online Archives

[ 66 ] September 1, 2016 |

index

If historians are going to keep writing awesome political histories and other histories, we are going to need access to primary sources. Increasingly those primary sources are online. But as we know, anything online can disappear in a blink. There was that moment early in Google’s history when it hoped to digitize everything. But then it decided it had no interest in providing great public services it couldn’t monetize. So that died. In the case of the newspaper in Milwaukee, its newspaper archives simply became so expensive that the public library couldn’t afford the service.

Where had Milwaukee’s history gone?

The archive had initially been made available on Google around 2008 as part of the company’s effort to digitize historical newspapers. That project ended in 2011, but not before Google had scanned more than 60 million pages covering 250 years of history’s first drafts. Those newspapers have remained publicly accessible, and serve both professional historians and home genealogists.

When the Milwaukee project began, Google used microfilms from the papers that had already been uploaded to the ProQuest research database. Because some things were missing from ProQuest, the Journal-Sentinel asked the Milwaukee Public Library to help out. The library let the company digitize decades of microfilms to bulk out the digital archives.

But as Google discontinued support for the project, the paper decided to construct its own archive. “It takes a long time to scan and get the archives up,” said James Conigliaro, the paper’s vice-president of digital strategy. “So we’ve been working on that.”

The paper had an existing relationship with Newsbank, a digitization and archiving company based in Florida. In 2014, Newsbank approach the Milwaukee Public Library about buying the rights to the Journal-Sentinel archives. The MPL already subscribed to two Newsbank services—an obituary archive and a modern database of the Journal-Sentinel–and regularly purchases proprietary databases whose subscription fees are in the low five figures. But it couldn’t afford the Journal-Sentinel archives.

In May, Newsbank came to the MPL again, offering a menu of purchase options. The most expansive offer was almost $1.5 million, with an annual hosting fee. That nearly amounted to the library’s entire $1.7 million annual materials budget. “To be asked to purchase outright something for a million dollars is just out of our scope of possibility,” said Paula Kiely, the library director.

Then, in August, Newsbank let the other shoe drop: According to Urban Milwaukee, Gannett—which purchased the paper in April—asked the Journal-Sentinel to ask Google to remove the paper’s digital archives, which the company did. It’s harder to sell a product when it’s being given away for free, after all.

Someone figured it could make money on history. It can only do that by charging an absolutely exorbitant amount of money. Many cities can’t pay. So the historical sources disappear. If it is going to be a requirement that someone profit in order to make primary sources public, the future of the historical profession is grim indeed.

FacebookTwitterGoogle+Share

Comments (66)

Trackback URL | Comments RSS Feed

  1. Warren Terra says:

    My memory is hazy: did Google lose interest in digitizing old printed content, because it costs money, or did they run into a blizzard of intellectual property claims and decide it wasn’t worth it?

    • Just_Dropping_By says:

      Based on the account in Wikipedia it sounds like several factors probably drove the decision to discontinue the Google News Archive project: https://en.wikipedia.org/wiki/Google_News_Archive

    • Brett says:

      It was both, although the copyright factor was much more prominent when they tried to digitize a bunch of books.

      Preservation in general is going to be hard in a situation like that, where you can only have digital archives if you negotiate through a blizzard of copyright claims. It’s not like with physical books, where you can just pile them up after purchase and have strong protected second sale rights.

      • Thom says:

        Meanwhile, they did digitize a lot of out-of-copyright books, which can be very helpful to researchers (though they did not do a perfect job of it–some pages are missing in some books).

  2. delazeur says:

    One of the many reasons I can’t stand the glorification of tech giants, and in particular the idea that they are benevolent guardians of the public interest. (I was honestly somewhat pleased to read about the SpaceX accident today.)

    • efgoldman says:

      I was honestly somewhat pleased to read about the SpaceX accident today.

      Are you old enough to remember the beginnings of the space program in the late 1950s? Because the number of failures getting early, primitive satellites into orbit became fodder for late-night comedy for a while.

  3. Stag Party Palin says:

    I used to get upset at things like this, but I have found peace. The truth is, it doesn’t really matter. (stands back to avoid flames) History is interesting, and it is instructive for some people – but very few. People who write history are biased, sometimes terribly. Most people who read history, like Otto Gershwitz, don’t understand it, don’t care, willfully misinterpret it, and so on. I’m only 70 years old and I have seen history forgotten several times, to our cost. In the cosmic scale of time it has been an eyewink since Mussolini and now we have a potential Mussolini version 2 and most people have no clue what the consequences might be.
    My apologies to professional historians, but you are losing to human nature.

    • (((Hogan))) says:

      It would probably be quite a bit worse without professional historians. Hard as that is to imagine.

    • cpinva says:

      “I’m only 70 years old and I have seen history forgotten several times, to our cost.”

      I’m 10 years your junior, and i must disagree with your description. history hasn’t been forgotten, it’s been intentionally ignored, because it was to someone’s profit to do so.

      a current example:

      the 8 year long train wreck that was the g.w. bush presidency. per the GOP, this never happened. what happened was that Bill Clinton, 22A notwithstanding, was President for 16 years, from Jan. 20, 1993, until Jan. 20, 2009. during this 16 year span, there were no republicans in Congress, having all been arrested and interned in FEMA concentration camps, those that hadn’t escaped to either Canada or Mexico. From Jan. 20, 2009 to the present, Pres. Obama and an all Democratic Congress took over.

      so, every bad thing that has befallen this country, since Jan. 20, 1993, every bad decision made, have all been the responsibility of the Democratic Party.

      don’t believe me? i challenge you to cite more than a very few mentions (and those only when it was impossible not to) of the Bush, Jr. presidency, in any news media tied to a major corporation. you will find very few, because he ceased to exist, the moment Barack Obama was inaugurated.

      this “disappearing” of recent history is absolutely intentional. to do otherwise would result in the few remaining sane republicans to commit group seppuku, when they realized what they’d done.

      • Stag Party Palin says:

        I’m 10 years your junior, and i must disagree with your description. history hasn’t been forgotten, it’s been intentionally ignored, because it was to someone’s profit to do so.

        We don’t disagree. ‘Ignore’ is one of many synonyms.

    • twbb says:

      “now we have a potential Mussolini version 2 and most people have no clue what the consequences might be.”

      It’s worse than Mussolini, because an incompetent narcissist in charge of pre-war Italy is not nearly as bad as an incompetent narcissist in charge of the most powerful country in the world.

  4. alercher says:

    The Slate article is misleading. The Journal-Sentinel is available from Newsbank. I’m a librarian and I checked. (7 hits on “trump” in the past 2 days.)

    The Milwaukee Public Library is subscribing (from Newsbank) to a digitized format of something it already owned in microfilm format.

    Probably somebody will be able to make money from the Newsbank database in the future, so probably it won’t be lost even if Newsbank changes hands or goes broke.

    Another question is how much of newspapers’ online content or print content will end up in Nexis or Newsbank archives in this format. The answer is not all, but that is a question for the database and newspaper publishers.

    • Richard Hershberger says:

      This is an important point. Scanning microfilm doesn’t harm it any, and they still have it. Not having access to the scans sucks, but frankly this mostly tells me that they should have been more careful when they agreed to let their microfilm be scanned, to have contractual language about the library’s access to said scans.

  5. Turkle says:

    Piketty was great on this point. Certain countries, I think India (?), stopped publishing economic statistics in book form, transitioning to digital only. Predictably, just a few years later, they are nowhere to be found.

    I am sorry, but the current reliance on digital archives is going to cause a huge black hole in the historical record.

    Just think: future generations will have no record of Trump’s tweets…

    Maybe it’s for the best…

    • delazeur says:

      Piketty was great on this point.

      Does he talk about this in Capital in the Twenty-First Century? I just got a copy, but haven’t started it yet.

    • skate says:

      Does anyone have a variable speed floppy drive that can read Mac floppies from the early 1990s and extract an electronic copy of a doctoral dissertation? AFAF.

      • The Lorax says:

        Ha! I still have Mac disks from then, too.

      • Caepan says:

        Why yes. Yes I do.

        Having been a Mac owner since the late 1990s, I have just about every peripheral, adapter, and doohickey that has graced a copy of a MacMall catalog. You need a SCSI Zip drive? (Internal or external?) A 1x external CD-ROM player with the tray? A Performa to Apple to RGB monitor adapter? Got ’em, and then some.

        I should really start an eBay store with all the obsolete computer stuff I have crammed in desk drawers in my home. Might start a little nest egg with the income.

        • Stag Party Palin says:

          What’ll you give me for an external IBM 8″ floppy drive? I’ve got lots of floppies with accounting data from 1977, written in RPGII, coded in EBCDIC, with a converter for copying to a DOS system.

          Only One Left!!!!!

          • efgoldman says:

            What’ll you give me for an external IBM 8″ floppy drive?

            Sumbitch. That’s what the radio station I worked for used in the 80s, only the hardware was Wang (local to Boston).
            We first had internal-only email on that beast ca. 1988, maybe? That’s when I first found out you better mean it when you hit “send”. Or as my engineer friend told me: emails are like ICBMs: once launched, they can’t be recalled.

        • I have a SCSI Zip drive. It did me a lot of good when my next Mac had no SCSI port as well as no floppy drive, the two formats to which I’d carefully archived everything.

      • (((Hogan))) says:

        Wait, we’re friends?

      • cpinva says:

        “Does anyone have a variable speed floppy drive that can read Mac floppies from the early 1990s and extract an electronic copy of a doctoral dissertation? AFAF.”

        funny you should raise this issue, i was going to. a problem i picked up on several years ago actually goes beyond even the medium issue, is the software issue. try opening a pdf file from 10 years ago, using the current version of Adobe Reader. you won’t be able to. the same goes for every Microsoft product. if you have a file in Word 97, it can’t be opened with Word 2010, unless you have constantly updated it, roughly every other new version or so, or you have a hard drive with the old versions of those apps on it, a motherboard that will operate it, and a monitor able to be used by them both.

        that’s the brilliance of Bill Gates. he forces you (along with the other makers of commonly used apps) to constantly update, or you will eventually lose everything. it’s also why most companies and gov’t agencies still maintain hard copies of all necessary documents, and probably always will. unless microfilm comes back into fashion.

        • Chuchundra says:

          Umm…what the hell are you talking about?

          I have PDFs from 1998 on my hard drive and Acrobat Reader opens them just fine. I also have Word docs from 1997 that MS Word is happy to open and display without issue.

          • efgoldman says:

            I also have Word docs from 1997 that MS Word is happy to open and display without issue.

            I don’t have any pdf files that are that old, but I’ve never had a problem with any MS-Word formatted docs from any era, in Word itself, Open Office/Apache or Libre Office. (Damned if I’ll pay for a MS office suite to use it a few times a year.)

            • skate says:

              Speaking of Libre Office, ISTR that it was the one piece of software I could use that would open my mom’s old AppleWorks (or was it Clarisworks?) files, because Pages certainly wouldn’t do so. Unfortunately, there was no way I or my brother was going to be able to get mom to actually use Libre Office.

            • Chuchundra says:

              You really only need the real MS Word® when you’re collaborating with other people Libre and the other free options can’t deal with track changes and other MS specific nonsense.

              If you’re writing a letter or something, Libre works just fine.

        • skate says:

          I don’t recall having any trouble opening real old Word DOCs, aside from the ones stored on the aforementioned Mac floppies.

          Excel 1.0 files gave me some grief, though. Years after I first discovered that nothing newer than about the Excel version of c. 1995 would open them, I discovered I could use SheepShaver to emulate a MacOS 9 system, and then install an elderly copy of Excel that would open Excel 1.0 files and save them as something somewhat newer.

          But don’t get me started on Zip disks and drives. Last time I tried working with those was last year. Damn things would show me the disk contents but whenever I tried transferring files from the Zips onto the computer hard drive, I got the click of death and would gave to reboot the machine.

          • Chuchundra says:

            The Zip drive was always a shit technology. They were unreliable and broke all the time even when they were brand new.

            But damn, a 100 megs on a re-writable disk just a bit bigger than a standard floppy was just too damn convenient to pass up.

      • Dr. Acula says:

        I have an Apple Disk ][ somewhere.

  6. bw says:

    Kiely expects the end result will be a kind of paid subscription service, and likely one of higher quality than Google’s offering, which was incomplete.

    Ugh, no it fucking will not be of higher quality. As a frequent user of Newsbank, I feel comfortable in saying that its product blows goats. Its UI looks like it was built in 1994 and I think hamsters on wheels power its datacenters. It’s also impossible for a user to tell just how complete it really is, as its archived newspapers are rarely if ever browseable, only searchable.

    For all of Google’s faults, at least for a while it made progress at freeing this material from the library equivalent of patent trolls. Ideally we’d have something like a British Library that would be the steward of this stuff, but since the Zombie-Eyed Grannystarver runs Congress we’re instead stuck with the minimal (though pretty well-executed) coverage provided by Chronicling America.

    • bw says:

      Oh, and while I’m ranting I might as well also mention the rank idiocy of having >94-year copyright law extend to newspaper content. I can at least understand how we might want to avoid having someone’s novel entering the public domain before, or soon after, they die. Who the hell is served by doing the same with newspapers, besides rentier incompetents like Newsbank and NewspaperARCHIVE? Newspaper publishers seem to be operating under the delusion that any meaningful number of people will pay $19.95/month for access to the Reno Gazette-Journal’s archives. It’s scrabbling over pennies.

      • Woodrowfan says:

        and some papers claim copyright status for even older material.

      • Warren Terra says:

        Newspaper publishers seem to be operating under the delusion that any meaningful number of people will pay $19.95/month for access to the Reno Gazette-Journal’s archives.

        Less absurdly, newspaper and magazine publishers often reserve access to the archives as a bonus for active subscribers to the new content.

        • bw says:

          How many of those subscribers ever actually bother using the archives? It seems to me that they’ve bundled their products in a way that looks superficially attractive, but satisfies nobody: the active subscribers who want the current newspaper get a throw-in they’ll never use, while the historians and genealogists are stuck paying either for the archives a la carte or worse, for a full active subscription.

    • cpinva says:

      “I feel comfortable in saying that its product blows goats.”

      ok, thanks so much for that disturbing mental image! i need to go bleach my brain now.

    • Dr. Acula says:

      As a frequent user of Newsbank, I feel comfortable in saying that its product blows goats. Its UI looks like it was built in 1994 and I think hamsters on wheels power its datacenters. It’s also impossible for a user to tell just how complete it really is, as its archived newspapers are rarely if ever browseable, only searchable.

      Sounds like the California EDD’s jobs database website. It appears to be a really crappy web covering on some sort of ancient MVS system.

  7. Brett says:

    Preservation is one of my biggest concerns when it comes to historiography going forward. So much primary source stuff is going to be things like digitally stored essays, or content created on social media – all of which can simply disappear if no one bothers to keep on migrating it over time. We could end up in a situation a few centuries hence where we have a wealth of statistical information and scientific papers, but a paucity of primary sources of anything in written form.

    • Lurker says:

      In EU, the copyright law has some important public good exemptions. For example, the Finnish National Library has the legal responsibility of crawling the net to archive regularly all Finnish or Finnish-language websites in existence. Similarly, all universitiy libraries have an unlimited right to digitalise their collections.

      However, such collections can, by law, only be accessed from computers physically in the said libraries, so their commercial value is nil.

    • wengler says:

      Preserving digital representations of newspapers is orders of magnitude easier than preserving physical copies or microfilm copies of said newspaper.

      This whole post is some weird bizzaro world when it comes to historical preservation. Access is a problem now, because most people aren’t going to go travel hundreds or thousands of miles to look up the April 17, 1971, Bumfuck Times at the Bumfuck Library. Said library burns down or decides it doesn’t have the room to store all those moldy old newspapers and oops bye bye history.

      Buying access sucks but ragging on digitization is just plain stupid.

      • Brett says:

        I didn’t say it was harder, just that it’s more vulnerable. Books can burn or rot, but they can also last centuries. Digital content only survives as long as you migrate it to new storage as the old storage wears out, and the time frame for that is much shorter than with books.

        Just to give you an example, if they ever shut down MySpace completely, a whole ton of stuff – years worth of social interactions – will be gone forever. Now imagine that happens with Facebook in a few decades, which is far larger and more extensive in its reach. Or hell, if Facebook just purges years’ worth of accounts after inactivity. That’s a massive record of social interaction completely erased.

  8. drwormphd says:

    Nicholson Baker’s Double Fold looks more prescient every day.

  9. GeorgeBurnsWasRight says:

    Sort of related: model railroading is a hobby of mine. Just a few years ago the dominant hobby magazine offered CDs with all of their issues going back to 1934 when they began publishing. Hobbyists, except the ones who model the recent era, often value old magazine issues which are difficult to locate and bulky to store.

    The disks were encoded to protect the publisher from people copying the disks and getting the info for free. But the program which unlocked the disks ran only on Windows XP, so now the people who bought the disks can’t access them only a few years, or less, from the date they paid for the disks.

    The new business model is that you pay a monthly fee if you want to access the info via the publisher’s web site. This has some advantages, but at any time the publisher could decide the number of subscriptions doesn’t justify the cost of the servers, etc. and end access. And of course, the publisher might go out of business, too.

    • Woodrowfan says:

      The Breweriana hobby has much the same issue. An IT guy scanned in all the back issues of a couple of the major hobby magazines and created his own reader that works on pretty much every Windows system after Windows 95. But it took several years of arguing for the major collector’s groups to agree members have it due to copyright. I have several shelves full of the old hard copies (over 40 years worth) just so I know I’ll have access still.

  10. Thom says:

    “The crisis consists precisely in the fact that the old is dying and the new cannot be born; in this interregnum many morbid symptoms appear.”

    Perhaps one of the morbid symptoms is the way that more material has become available, but with no certainty that it will remain available.

  11. […] reason I decided to write today was because I saw this blog post about this article, in which it was described how the on-line archives for a major metropolitan […]

  12. BethRich52 says:

    I have nothing to add on the technical front, but this post is making me nostaglic for my days spent scrolling through newspaper microfilm as I did research for college hisotry papers.

Leave a Reply

You must be logged in to post a comment.