0. The following contains information regarding the POS tagging for the English-Chinese and English-Arabic Treebanks. The first section deals with changes in policy regarding hyphenated items; the second is a list of decisions that were made in the course of POS-tagging the corpora. 1. Changes In Tokenization of Hyphenated Items 1.1 Introduction Significant changes in tokenization policy have been implemented in the English-Chinese and English-Arabic Treebanks. In previous Treebank incarnations, all hyphenated strings were tokenized as a unit. This necessiates a trade-off of keeping the annotation task simple against introducing problematic structures. A standard example of a problematic structure is "New York-based". The old Treebank policy split this as two tokens as follows: 'New' and 'York-based'. Since 'York-based' clearly does not form a single lexical (or semantic) unit, whereas 'New York' does, it is counterintutitive and misrepresentative of the actual structure to leave such items as a single token. There are, in addition to this, many other uses of the hyphen which could stand to have their annotation revisited and ameliorated. One possible solution to this would be to separate every occurrence of a hyphenated item. This, however, would overgenerate tokens, thus needlessly complicating the annotation task. This is due to the variety of uses of the hyphen in standard written English and, more importantly, the fact that not all hyphenated items form lexical/semantic/morphological units. The goal then is to determine a policy for the annotation of hyphenated items that reasonably reflects morphological constituency, is felicitous to other levels of annotation and will be intuitive to annotators. It should be noted that many other uses of the 'hyphen' either fall under established policy, or can easily be accounted for in a way that is harmonious with pre-existing policy. The two most common of such uses are simple punctuation (in which case it is technically an em- or en-dash) and 'symbolic' usage (as in 40-50%). In cases where a 'hyphen' is acting as punctuation, it should be treated as whichever form of punctuation it is functioning as. In practice it is usually a comma or a colon. Symbolic usage (keeping in mind that the main determining factor of such is if it can be read out as a word) is tagged as SYM. The biomed project has introduced two major innovations relevant to this question. First is the use of the POS tag HYPH to label hyphens. Second is the label AFX for subword morphological units. These two tags allow great flexibilty in treating hyphenated items and have thus been adopted for the English-Chinese and English-Arabic treebanks. 1.2 The following guidelines assume an input similar to the Treebank II guidelines, i.e., that all hyphenated items are single tokens. 1.2.1 Do not break anything into units smaller than a word. cross-strait/JJ relations pro-Beijing/JJ position conduct a meta-search/NN high-tech/JJ solution vis-a-vis/IN wishy-washy/JJ 1.2.1.1 This is suspended where a bound morpheme takes scope over a multi-token item as in: non/AFX -/HYPH ethnic/JJ Chinese/NNPS a mini/AFX -/HYPH Marshall/NNP Plan/NNP was self/AFX -/HYPH designed/VBN and -/HYPH implemented/VBN 1.2.1.2 It is also suspended in (the relatively rare) instances of coordinated bound morphemes: pre/AFX -/HYPH and/CC post/AFX -/HYPH natal/JJ care Indo/AFX -/HYPH European/JJ and/CC -/HYPH Iranian/JJ linguistics 1.2.2 Break collocations involving participles. Hong/NNP Kong/NNP -/HYPH based/VBN companies wrinkle/NN -/HYPH removing/VBG cream 1.2.2.1 This policy is suspended if breaking it would result in a subword unit: self-governing/JJ island non-leg-related/JJ problems non-threatening/JJ stance 1.2.2.2 The putative participle does not correspond to an actual verb (or not in the usual sense of an existing verb). half-assed/JJ job is very hard-nosed/JJ life-sized/JJ portrait 1.2.2.3 It is a combination of participle-particle chopped-off/JJ finger grown-ups/NNS 1.2.2.4 It occurs as part of a proper noun "A Number of Regulations Concerning Establishing Foreign-Invested/NNP Enterprises" 1.2.3 In general, break combinations with simple verbs as the final element: must/MD -/HYPH see/VB television the bag/NN -/HYPH wrap/VB method 1.2.3.1 However, do not break when simple verbs are the first element. committed a break-in/NN a bad break-up/NN the know-how/NN got the go-ahead/NN 1.2.4 Break all collapsed phrases. a/DT -/HYPH man/NN 's/POS -/HYPH life/NN -/HYPH is/VBZ -/HYPH hard/JJ generation the editor/NN -/HYPH in/IN -/HYPH chief/NN 1.2.5 Break all noun-noun (including proper noun) combinations. have beer/NN -/HYPH guts/NNS Taiwan/NNP -/HYPH Palau/NNP trade 1.2.5.1 This does not apply to nouns which are tagged as NNP because they appear in a proper noun. the '98 East-West/NNP China Cooperation, Investment and Trade Negotitaion Conference 1.2.6 Do not break adjective-adjective combinations. white-hot/JJ poker 1.2.7 Break noun-adjective combinations. visa/NN -/HYPH free/JJ entry Rumsfeld/NNP -/HYPH free/JJ administration secretary/NN -/HYPH general/JJ world/NN -/HYPH famous/JJ annotator 1.2.8 Do not break adjective-noun combinations. are all pretty-boys/NNS large-scale/JJ assault works full-time/NN left-wing/JJ leanings Western-style/JJ houses 1.2.9 Break any combination with a cardinal number. 16/CD -/HYPH year/NN -/HYPH old/JJ gymnast 20/CD -/HYPH odd/JJ years 1.2.9.1 Numbers which are spelled out are excepted. twenty-five/CD years three-hundred/CD and forty-five/CD 1.2.10 Do not break combinations of adjective and adverb (in either order). remain ever-alert/JJ first-ever/JJ 1.2.12 If a hyphenated item is to be treated as a single token, follow the guidelines in the old POS manual. 1.2.13 Complex collocations. In instances where hyphenated items are themselves attached via a hyphen, the decision on how to break them should follow the bracketing of the unit. For instance- (elementary-school)-age children should be- elementary-school/NN -/HYPH age/NN children Conversely- non-leg-related should be- non-leg-related/JJ 1.2.14 /t,d/ Deletion. Items such as the following were originally combinations of participle and noun. As part of the more general phenomenon of deleting coda /t,d/ in modern English, they appear as life-size ice-cream skim-milk Note that the hyphenated forms are not always the prescriptively preferred ones, but given that there is so much variation we can easily imagine them, and more importantly, we should be ready to encounter them. The default in such instances should be to break them as noun-noun combinations. If, however, there is no corresponding noun (as in skim-milk), it should be unbroken. 1.2.15 Participle vs. Adjective. In instances where it is ambiguous whether it is an item is an adjective or participle, apply the diagnostics listed in the POS manual. Despite this, there will unavoidably be some inter-annotator (and possibly intra-annotator) variation. So it is possible to imagine the following competing annotations- clean/RB -/HYPH shaven/VBN men or clean-shaven/JJ men easy/RB -/HYPH going/VBN people or easy-going/JJ people 1.2.16 Issues of Scope. Hyphenated items are also broken to ensure proper scope of adverbial modifiers. In practice, this has only occured in items like the following: then/RB -/HYPH provincial/JJ governor/NN 2. Decisions: 2.1 CC or IN versus/IN 2.2 IN or JJ a field near/IN here 2.3 IN or RB Putative prepositions that cannot occur by themselves are labeled RB. over/IN 10 along/IN with it is up/IN to/IN you as/IN though/IN about/RB 10 around/RB 10 aside/RB from away/RB from was about/RB to came in/RB second/RB move around/RB inside/IN Namibia get close/RB to live up/RB to its reputation 2.4 IN or VBG including/VBG according/VBG to/IN 2.5 IN or WDT it wasn't until 1979 that/IN it struck a chord with someone 2.6 JJ or NN express/JJ mail the outside/JJ world anti-trust/JJ authorities the latter/JJ chief/JJ executive human/JJ rights mini/JJ disc the half/JJ year since X authentic as the original/JJ middle/JJ age/NN at present/JJ in brief/JJ eastern/JJ Jerusalem east/NN Jerusalem military/NN action October 30th/NN executive/NN director associate/NN professor midlife/NN crisis chemical/NN and mechanical/JJ industries 2.7 JJ or NNP the American/JJ player Joe the Spanish/JJ player Joe the Democratic/JJ convention Cretaceous/JJ period the American/NNP Joe the Spaniard/NNP Joe pleased Democrat/NNP delegates 2.8 JJ or RB stand firm/JJ broadcast the matches live/JJ released of late/JJ late/JJ January hold a meeting late/RB this month cut short/RB a visit the surveillance cameras were on/RB get right/RB die young/RB came in second/RB came first/RB 2.9 JJR or RBR more/JJR than 30% 2.10 JJ or VBG well-meaning/JJ advice a losing/JJ score fighting/JJ spirit the missing/JJ include award/NN -/HYPH winning/VBG top/RB - /HYPH grossing/VBG ruling/VBG party best/RBS -/HYPH selling/VBG the finishing/VBG touches disputing/VBG parties opening/VBG arguments the stimulating/VBG performance 2.11 JJ or VBN refined/JJ rice the mixed/JJ economy mixed/JJ ownership economy employed/JJ people extroverted/JJ agriculture specialized/JJ accessory factories developed/JJ countries advanced/JJ technology are concerned/JJ about handheld/JJ device known/JJ information involved/JJ parties sunken/JJ submarine wasted/JJ attacks the injured/JJ crisis-ridden/JJ torch-shaped/JJ stripes self-motivated/JJ well-known/JJ so-called/JJ middle-aged/JJ people feeling depressed/JJ world-renowned/JJ well-attended/JJ ceremony ready-made/JJ tools human-faced/JJ jar square-patterned/JJ pottery jar finished/JJ reproductions round-faced/JJ three-legged/JJ a blown/VBN glass factory bonded/VBN area the proposed/VBN project processed/VBN products government run/VBN hotel preferred/VBN shares booby-trapped/VBN revered/VBN religious establishments heavily/RB -/HYPH policed/VBN liquified/VBN natural gas televised/VBN debate failed/VBN attempt to assassinate to stop Libyan/JJ -/HYPH backed/VBN martyred/VBN Palestinian people repeated/VBN attacks fast/RB -/HYPH paced/VBN man/NN -/HYPH made/VBN above/RB -/HYPH mentioned/VBN 2.12 NN or NNP RMB/NN GDP/NN Chapter/NN 4 of the US Commercial Code more than 120,000 State/NN owned and collective enterprises person A/NN MP3/NN CD/NN CD-ROM/NN CD-R/NN beiguan/NN and nanguan/NN music PR/NN manager works in R&D/NN PhD/NN CEO/NN (NNP if part of a title) Formula/NNP One/NNP Shahab/NNP 5/NNP rocket King Mohammed IV/NNP Islamic/NNP Revolution/NNP the Israeli/NNP Archaelogy/NNP department/NN Syrian/JJ Army/NNP team the Sixth/NNP Arab Premiership Cup in Soccer game hosted by NT/NNP $/SYM 100/CD Hakka/NNP eight tone, Fujian/NNP ten tone Chinese yew (Taxus/NNP mairei/NNP) 2.13 NN or NNS For items which are ambiguous between NN(P) and NN(P)S, the default is to rely on the surface forms. a 2.45 m/NN jump economics/NN 500 yuan/NNS sports/NNS channels personnel/NNS police/NNS said that the outskirts/NNS of the town headquarters/NNS cattle/NNS ranch 2.45 m/NNS foreign affairs/NNS savings/NNS and loans/NNS communications/NNS sales/NNS income 2.14 NNP or NNPS Proper nouns with plural morphology which take singular verbal agreement are labeled NNP (pace previous policy): United States/NNP United Nations/NNP Emirates/NNP Texas Instruments/NNP the Philippines/NNP Tianjin Customs/NNP has decided Foreign Affairs/NNPS Bureau since the Middle/NNP Ages/NNPS the Olympic Games/NNPS Palestinian/NNP Territories/NNPS Alshabura Martyrs/NNPS Camp 2.15 NNS or NNPS the Seventies/NNS 2.16 NN or RB at home/NN home/RB game away/RB game 2.17 NN or VBN the accused/VBN 2.18 NN or VBG the shooting/NN a firefighting/NN plane broadcasting/NN rights satellite broadcasting/NN the opening/NN ceremony operating/NN system training/NN times fierce fighting/NN killed during fighting/NN with the enemy a symbolic handing/NN over/RP ceremony car racing/NN the hijacking/NN operation the negotiating/NN table international campaigning/NN fell to X on closing/NN disturbing/VBG statistics return to playing/NN competing/VBG teams rampaging/VBG camel a cheering/VBG crowd in spite of increasing/VBG acts of violence the qualifying/VBG matches working/VBG out (exercising) bird/NN -/HYPH watching/VBG 2.19 PRP or PRP$ a factory across the street from his/PRP 2.20 RB or RBR no longer/RB 2.21 RB or RP carry on/RB the storm died down/RB to go on/RB until then calm down/RB live up/RB to its reputation play about/RB let their hair down/RB spring up/RB went on/RB to say brought about/RP give in/RP to pressure set up/RP sign up/RP for come up/RP start out/RP the plane takes off/RP take clothes off/RP play down/RP a sea mine left over/RP from WWII oxygen will run out/RP completely flare up/RP violence broke out/RP find out/RP watch out/RP for/IN X working out/RP (succeeding) mellow out/RP turned out/RP to be pull out/RP 2.22 VB or VBP how come/VB 2.23 Other Decisions: real/JJ estate pursuant/JJ to law grassroots/JJ organisation in the lower/JJ picture Post Code: 423700/CD none/NN other/JJ than/IN to air/NN strike 30/CD -/HYPH some/DT prime/JJ minister degrees centigrade/JJ anthraquinone/NN hydrogen peroxide solution let/VB alone do X regardless/RB of the outcome of the election must comply whatever/WDT the circumstances an investigation was underway/JJ in/IN absentia/FW per/IN capita/FW per/IN annum/FW etc./FW a yuhua/NN stone bid farewell/NN 2.24 Decisions On Splitting Tokens: twenty-five/CD double-digit/JJ unemployment rate the post-industrialization/JJ mid-development/JJ stage a large, high-speed/JJ V-shaped/JJ rise post-processing/JJ polished rice start-up/JJ projects vice-Premier/NNP to be multi-share/JJ holding companies self-run/JJ business import and export rights help laid-off/JJ workers be re-employed re-employment/NN projects/NNS mid-September/NNP anti-trust/JJ authorities anti-sanctions/JJ booby-trapped/VBN runners-up/NNS insofar/RB as Palestinian-Israeli/JJ top-class/JJ forward over-emphasize/VB self-confidence/NN mini-disks/NNS audio-visual/JJ storage medium all-out/JJ effort the go-ahead/NN hands-on/JJ work round-robin/JJ system most/RBS -/HYPH favored/VBN -/HYPH nation exclusively -/HYPH owned world/NN -/HYPH class/NN Oil/NNP - for/IN - Food/NNP program three/CD -/HYPH and/CC -/HYPH a/DT -/HYPH half/NN hours in/IN -/HYPH depth/NN quarter/NN -/HYPH finals/NNS called for a mini/AFX -/HYPH Marshall/NNP Plan anti/AFX -/HYPH Sharem/NNP el-Sheikh demonstrator mass/NN -/HYPH production/NN two/CD -/HYHP thirds/NNS general-purpose/JJ clay bust/NN -/HYPH length/NN portrait then/RB -/HYPH provincial/JJ governor aroma/NN -/HYPH therapy/NN him/PRP -/HYPH or herself/PRP