############################################################################## # # Andrew G. West - Wikipedia Vandalism Corpus - Updated 2010/03/05 # # This file describes the two corpora (rids_rb_vand.txt and rids_ml_vand.txt), # which should be packaged with this file. It describes the method and # rationale for labeling examples, as well as some basic statistics. # ############################################################################## ######## GENERAL INFORMATION ########## All examples in this corpora were pulled from a 2009/11/03 dump of the English-language version of Wikipedia. In particular, examples are "revision-IDs" (R-IDs), unique identifiers Wikipedia assigns to each edit. To learn more about the edit associated with an R-ID there are two methods: (1) Wikipedia performs frequent dumps of its content in XML and SQL format. As of this writing, more information is available at [http://en.wikipedia. org/wiki/Wikipedia_database]. Be aware these files are large, unpacking to many gigabytes, or even terabytes in size. (2) Wikipedia has a networked-API [http://en.wikipedia.org/w/api.php] over which small batches of data can be obtained. For casual experimentation, visit [http://en.wikipedia.org/w/index.php?oldid=@@@&diff=prev] and replace `@@@' with an R-ID -- the output will show the edit-diff. Researchers are free to use these corpora in their own study of Wikipedia. We simply ask that you cite our research when doing so: [*] West, A.G., Kannan, S., and Lee, I. (2010). Detecting Wikipedia Vandalism via Spatio-Temporal Analysis of Revision Metadata. In EUROSEC '10: Proceedings of the Third European Workshop on System Security. Paris, France. April 2010. (A preliminary version was published as Technical Report UPENN-MS-CIS-10-05). Anyone with questions about the data should feel free to contact us; westand@cis.upenn.edu. We would also love to here from any one using this set and/or see research utilizing it. Finally, for those interested in extending the set, we have large numbers of R-IDs which our tool indicates are very likely vandalism -- but have not yet been manually confirmed. ###### ROLLBACKED REVISION-IDS ######## Contained in file [rids_rb_vand.txt] are 5,713,762 revision-IDs which were flagged as `blatantly unproductive' by privileged Wikipedia users. Each line in the file contains one R-ID. which were obtained as follows: (1) Wikipedia, for a small set of privileged users, enables `rollback', which is essentially an expedited form of the `revert' function available to all users. When a rollback takes place, a revision comment on the form 'Reverted edit by x to last version by y' is auto-written to the database. We searched revision comments for strings of this form, and several others (some of which are left by the `editing assistants', Huggle and Twinkle): Standard: "REVERTED EDIT% BY % TO LAST VERSION BY %" Linked: "[[WP:RBK|REVERTED]] EDIT% BY % TO LAST VERSION BY %" Huggle: "REVERTED EDIT% BY % TO LAST REVISION BY %" Twinkle: "REVERTED % EDIT% BY % IDENTIFIED AS [[WP:VAND|VANDALISM]]%" (2) In total, we found 6.5 million revision comments of this form, which we term `flagging edits'. For each flagging edit, we proceeded backwards through the article history to find the guilty edit (i.e., the one by `x'), which we call an `offending-edit' (OE). We were successful in 99.61% of cases. (3) Finally, we confirmed that the editor who initiated the rollback (i.e., the one who made the flagging edit) was privileged to do so (against a Wikipedia-provided permissions table). 88% of potential-OEs were conducted by users who had proper permissions (on the dump date); the 5.7 million edits which are published in the packaged file. Briefly, 91.15\% of these OEs reside in Wikipedia's main namespace (NS0). The first OE was committed at UNIX time-stamp 1075835039 (03 Feb 2004 19:03:59 GMT). We make no guarantees as to the quality of this labeled set. However, given that OEs are initiated by privileged (i.e., trusted) users, it is one we have used with high-confidence in our own research. ###### MANUALLY LABELED VANDALISM ##### Contained in file [rids_ml_vand.txt] are 5,291 revision-IDs which have been `manually labeled' as vandalism. These R-IDs are a completely disjoint set from those found in [rids_rb_vand.txt]. R-IDs exhibiting potential-vandalism were produced using the classifier described in the citation [*] above. Actual vandalism was distinguished from false-positives, as follows: (1) A set of potential-vandalism was produced (R-IDs). Then, using the 'html2image' utility, the nicely contextualized and colored edit-diffs from Wikipedia [http://en.wikipedia.org/w/index.php?oldid=@@@&diff=prev], were batch downloaded in `html' format and locally converted to 'jpg.' (2) Using a picture viewer with slide-show functionality (`qiv', in our case), we cycled through the diff-images, scoring them in a spreadsheet as `vandalism' or `not-vandalism' (see below). We found a rate of about 12 edits/minute was sufficient for both speed and accuracy. All of the manually flagged vandalism resides in NS0, and were made within a year of the dump date (i.e., 2008/11/03 to 2009/11/03). In particular, the R-IDs may not be well distributed throughout this time period (they should favor more recent dates). Further, all the edits in this set were made by anonymous users. Again, we make no guarantees regarding corpus accuracy. ########## VANDALISM CRITERIA ######### Of course, the line between vandalism and not-vandalism is a very blurry one. For this very reason, we were conservative in our tagging, and therefore choose not to publish the complementary 'not-vandalism' set. The following non-comprehensive guidelines were used to flag vandalism: (1) At the base-level, vandalism is any addition of content which is blatantly incorrect, offensive, nonsensical, or non-value adding. (2) For any edit where there was reasonable doubt about if vandalism took place, we erred on the side of caution and considered it `not-vandalism.' (3) Grammatical considerations are not primary. If an addition was of any value, albeit in poor-English or with poor spelling, this was NOT considered vandalism. However, if the lone change to an article was to make something grammatically correct, incorrect, then this was considered vandalism. (4) We examined the English-language edition of Wikipedia. Any text written in a foreign language (aside from the occasional pronunciation) was considered vandalism, with no attempt to parse its meaning. (5) Edits which grammatically corrected vandalism (i.e., hate speech) were NOT considered vandalism. However, edits which extended existing vandalism, were considering vandalism. Finally, ignorance of vandalism (that is, editing around vandalism, but not repairing it) is NOT vandalism. (6) Ignorance of the rules/policies of Wikipedia is not vandalism. Attempting to add a telephone number to a businesses' article, for example, is against Wikipedia policy. However, we do not think such an act has malicious intent. (7) Any edit which is blatantly off-topic is considered vandalism, regardless of the factuality/merit of the content therein. (8) Mass deletion of content without justification is vandalism. ########### LABELING CAVEATS ########## Small amounts of bias are introduced into our corpus via, (1) our method of finding potential vandalism, (2) the method by which potential vandalism is inspected, (3) practical limitations, and (4) the human inspectors that make the final determination. Each of these is now handled in turn: (1) The technique described in citation [*] above is used to generate the sets of potential-vandalism. As such, the tagged vandalism is not a random subset of the vandalism present on Wikipedia. Instead, the tagged vandalism is a subset that exhibits spatio-temporal properties indicative of their malicious nature. (2) Given that text-diffs are inspected, it is extremely difficult to detect vandalism which involves external hyperlinks and images. (3) Since `false' edits constitute vandalism, a completely thorough tagging can only be completed by an omniscient individual who can establish `truth' for every detail, for every article on Wikipedia -- so there are practical limitations. Consider subtle vandalism, such as the altering of numeric data, or vanity-edits where one places their own name in a narrative. Without intimate knowledge of the topic, such incidents can be extremely hard to definitively label as vandalism. (4) Building on the above point, the humans who made the final determination (i.e., this documents authors), are biased in the topics they have an intimate knowledge about. While we would be able to detect subtle vandalism on some topics (e.g., computer science), the same cannot be said for others (e.g., Japanese Manga and Comics). #################################### END #####################################