When Twitter began exploding last Sunday night in the run-up to Obama’s big speech, one of the first things I did was set the tweet-harvester my husband has invented to start collecting all tweets on the hashtags #OBL and #Bin Laden. My thought was to archive them for study, search or at least word cloudage later, much as I did with the Tahrir square tweets or the ISA2011 conference hashtag twitter feed.

For Sunday’s bin-Laden coverage, I only ran 100 iterations in 5-minute increments, but still ended up with 140,465 tweets at #OBL and 236,046 at #bin Laden. Stu ran a similar search of his own over a longer period using search terms instead of hashtags and is now sitting on over two million tweets, what he calls a “first draft of history.” (Indeed. Though in fact this is only a small sample of the total tweets occurring during the speech, the highest volume ever.)

Yesterday, Stu’s company began giving away the archive for free to researchers or journalists:

On the May 1, 2011 evening it was announced that Osama bin Laden had been killed, we started running repeated fetches against the Twitter API for the terms “osama” and “bin laden”. On May 3, we posted more than 1.2 million tweets in XML format. Since then, the live feed collection on DiscoverText keeps rolling along. The results are frankly more Tweets than anyone might ever need to understand this slice of the the micro-blogging public sphere during a critical juncture in world history.

Now get this: Twitter has contacted DiscoverText and insisted he stop giving away the data. Nick Judd at TechPresident has the story:

Shulman’s scholastic interests appear to be directly opposed to Twitter’s, which has an obligation to protect both its commercial viability (all it has to sell against is the content passing through its platform) and the privacy of its users. Facebook, which limits the use of its API to store user data for prolonged periods, has a similar stance.

I see a couple of interesting issues here.

First, it’s not entirely clear what Twitter’s concern is, but it’s probably not over “user privacy.” Judd is conflating Facebook, which is in theory a closed community where users get to choose who sees their updates, and Twitter, whose feeds are open to anyone who follows anyone, where users don’t get to choose who follows them, and where any tweet is essentially in the public domain (and users understand this). The kinds of FB privacy issues I’ve often blogged about (and which I’ve raised mightily with Stu with respect to his tools being used on FB data) just don’t make much sense in the Twitter case. Am I missing something?

Looking at Twitter’s actual communication to DiscoverText, it looks like they’re concerned with the terms of service, which do prohibit “redistributing… Twitter content… to any third party… without prior written approval from Twitter.” Oops. Guess I violated that rule when I cut and pasted jacksonjk’s tweet about the OBL story last Sunday night. So has any blogger who has ever taken a screenshot of a tweet and posted it, or any news article that has ever relied on a tweet as a news source or treated one as a news story. Maybe this incident will be one among many that cause Twitter to reconsider that rule in light of the way in which tweets are actually used.

For now, DiscoverText has taken the archive offline pending a resolution of the issue, but Twitter hasn’t said anything about dissemination visualizations of the data, so for what it’s worth here’s a word cloud of a sample:

What are reader’s thoughts? Is Twitter data proprietary? Is it private? How so? Will the world be better or worse if online tools make the collection and analysis of crowd-sourced coverage of political events easier to more quickly sift, sort, mine and understand? Could this power be abused? How is this connected to the evolving relationship between social media, conventional media, science, politics and military operations?

  • NonyNony

    and where any tweet is essentially in the public domain

    I don’t think that this is actually correct. Tweets are public broadcasts, but that doesn’t mean they’re necessarily in the public domain. I’m not sure what copyright law has to say about messages of 140 characters, but I do know that things you post to the web are not automatically “public domain” just because you put them out in public. Blog posts are a classic example of this – you can’t just take someone else’s blog post and republish it without their permission, even if they’re “giving it away for free” on their own blog.

    Any copyright lawyers want to weigh in on whether tweets would be covered by copyright?

    • OK, well on the one hand I don’t disagree with you – “public domain” is the wrong term for what I’m describing, which is simply the publicly visible nature of tweets relative to (in theory) walled-off FB content.

      But I disagree with your comparison to blog posts. People can and do copy and paste blog posts or portions of posts and redistribute them all the time – in fact this is a fundamental part of how blogging works. The norm is to cite the original poster, of course, but it is not the case that anyone must seek the original poster’s permission before redistributing their words.

      I think what I’m arguing is that tweets are more like blog posts and less like private messages. This is true empirically – you don’t often see people redistributing FB status updates by their friends on blogs, but you do see tweets cited all the time.

      My argument though is not about citing or sourcing tweets but about the value of collecting and aggregating tweets around an event for study. I have not seen anyone would object to people collecting blog posts and studying them as examples of political discourse. In fact scholars often do this – the problem is it’s currently hard to do systematically because of the nature of the data. Why are tweets different, other than because Twitter says they are?

      • NonyNony

        The incident with blog posts I was referring to was when a magazine was caught blatantly stealing blog posts from a blogger and republishing them in their magazine. The editor blindly thought that since it was published on the web it was “public domain” – which is not the case. You can’t republish other people’s work without permission, though there is the “fair use” exemption which allows you to quote snippets of other people’s work for criticism, discussion, etc. Bloggers make use of of “fair use” with each others’ work and with the work of non-bloggers (like, say the NYT) all the time – it’s what makes blogging work. But that’s because there’s a specific fair-use exemption in there.

        If I collected all of your blog posts here and put them into a book, would you agree that I had violated your copyright? What if it was an epub book being distributed on Barnes and Noble’s web store? These are pretty clear violations of your copyright on your work – does it become murkier if I collect the work, package it up as a book and then give it away for free?

        As far as collecting corpora for research purposes – yeah, academics do it all the time. Check with your university lawyer on actually REDISTRIBUTING that work though – if your uni is anything like mine they will have a policy on what can and cannot be redistributed without permission of the original authors. (If I assemble a corpus of New York Times articles that the Times has copyright on, I can’t redistribute them without permission – why would I expect collections of blogs to be any different?)

        • Hogan

          “Using without permission” and “using without attribution” aren’t the same issue, are they?

          • rea

            One is pretty much a subset of the other, since those who use without attribution seldom have the permission of the copyright holder

        • Stu

          why would I expect collections of blogs to be any different?

          Perhaps as part of the scientific effort to improve human language tools and methods through replication and transparency? That is the reason the tools and sharing ideals, in this case, were born.

          I think a lot of what is taken to be a violation here is defined by the scale the technology enables, not the act itself. Sharing data (fair use) for scholarly purposes is well established. Though Twitter accused us of selling the data, we are really selling the tool and the methods behind it.

          I’m not certain Twitter is worthy of mining or that with all the time devoted to tool building I could even start to answer that question. Perhaps this is the Werner von Braun defense?

  • chris

    the terms of service, which do prohibit “redistributing… Twitter content… to any third party… without prior written approval from Twitter.”

    Interpreted literally, this would include reading a tweet out loud to someone sitting where they can’t see the screen.

    ISTM that the question is not whether the contract says you have no rights and Twitter has them all; of course it says that, it was written by Twitter’s legal department. The question is how much of that the courts are willing to enforce.

  • Scott P.

    So the word cloud distinguishes “Death” and “death”? That’s silly.

  • Hogan

    Nice to see that even in these tumultuous times, Justin Bieber is still relevant.

    • Randy Owens

      No kidding. I was looking at it, and ‘Justin’ caught my eye, and I spent a few moments trying to think of relevant administration or military members named Justin. Then, the light bulb came on, and I looked for ‘Bieber’ to be in there somewhere, and found it rather quickly.

      But why?

      • SeanH

        Go to Twitter and click on one of the currently trending hashtags. In the resulting feed, you’ll see lots of spam tweets with a link and a random collection of currently trending words (so they show up on searches). I’ll bet that’s what’s happening – “Justin Bieber” was also trending, and spambots picked up on that.

  • I’d be willing to bet that bloggers are more aware of- and more sensitive to intellectual property issues than are Twitter users (to the extent that the categories don’t overlap).

  • TwitterVirgin

    I guess I don’t understand twitter very well. I thought “tweets” could be searched by anyone. Couldn’t I get the same data by going to twitter myself and using the same search terms, with the specified time period? Or are tweets not archived?

    • That’s the point. Tweets are not archived, AFAIK. Stu’s tool (and possibly some others) allow for Tweet archival, a amazingly useful service for researchers.

      • Tweets are archived by all types of people through the API, and I’d be questioning Twitter’s sanity if they aren’t archiving all the material somewhere. This is typical boilerplate TOS. A better approach might be to open-source the actual tool for the collection of the data, rather than the data itself.

        • I think that part of the underlying software is open source (http://www.qdap.pitt.edu/). Stu can probably explain the difference between DT and QDAP more.

          I am a big fan of open source, but at the same time, I am well familiar with the fact that usually, the development speed of open source projects is glacial, and often, they are abandoned.

  • Ian

    The redistribution is for academic use. I know that copyright rules are looser for copying for academic use than they are ordinarily. Would this affect the legal status of the project?

    • Are they, really? Or is it that we don’t care, because few people actually sue academics?

      Fair use is overrated, and not international.

      • (the other) Davis

        I’m not sure what you mean when you suggest that fair use is “overrated” (do you think every use for any purpose should be licensed?), but the statutory fair use provision, 17 USC § 107, indicates that use of a copyrighted work for scholarship or research are presumptively fair.

        • (the other) Davis

          Which is to say that, yes, copyright rules are indeed looser for academics — at least in the US.

          • Till somebody actualy threatens to sue you. In that case, even if you were to win, the costs (money, time) are beyond what most academics (or people, in general) are willing to stomach.

            • (the other) Davis

              For employed academics engaging in research as part of their employment, the employer (likely a university, in many cases) would actually be the one to defend the suit, so resources would be less of an issue than you suggest. And an institution with any foresight — which may be a generous assumption — would fight the lawsuit with the aim of obtaining favorable precedent, so as to limit long-term costs.

              As a real-life example, Texaco attempted to defend the photocopying of articles by its researchers under § 107, in American Geophysical Union v. Texaco. (Texaco ultimately lost, as the court characterized their copying as not for research purposes, but rather for archival purposes.)

  • Avelino

    Twitter, whose feeds are open to anyone who follows anyone, where users don’t get to choose who follows them, and where any tweet is essentially in the public domain (and users understand this)

    Minor quibble, but Twitter does provide these options: approving your followers, and private feeds that are not accessible to the search engine.

    Of course, those tweets wouldn’t appear in the searches you mention (unless, I believe, somebody does a retweet), but those types of accounts exist.

  • Could be that they fear that DiscoverText undermines the viability of Twitter selling their unthrottled “firehose” feed, which I understand is currently their biggest revenue generator (through sales to Google & Microsoft).

