Changes between Initial Version and Version 1 of SkE/PrepareText

Show
Ignore:
Timestamp:
01/31/08 06:58:59 (3 years ago)
Author:
adam
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SkE/PrepareText

    v1 v1  
     1= Preparing Corpus Text for the Sketch Engine = 
     2 
     3The input format is "vertical" or "word-per-line (WPL)" text, as defined at the University of Stuttgart in the 1990s.  Words are written one word per line, so each line contains one word, number or punctuation mark. It is a plain text (ASCII) file in a selected character encoding, without any formatting. 
     4{{{ 
     5Suddenly, however, their posture changed. 
     6}}} 
     7is in vertical text 
     8{{{ 
     9Suddenly  
     10,  
     11however  
     12,  
     13their  
     14posture  
     15changed  
     16. 
     17}}} 
     18If the input text is part-of-speech-tagged and lemmatised, then we provide two additional columns, tab-separated, for tag and lemma as here (showing tags from Penn tagset): 
     19{{{ 
     20Suddenly        RB      suddenly 
     21<g/>  
     22,       ,       ,  
     23however RR      however  
     24<g/> 
     25,       ,       ,  
     26their   PP$     their  
     27posture NN      posture  
     28changed VVD     change  
     29<g/> 
     30.       SENT    . 
     31}}} 
     32 
     33The "glue" tag <g/> is used to specify that there should not be space between two tokens, as between a word and following punctuation (in Latin and other Western scripts). 
     34 
     35Sometimes there might be multiple or disjunctive values for an attribute, for example if the POS-tagger was undecided between classifying a word as a noun (NN) or a lexical verb (VV), or if a word is associated with two grammatical relations.  This can be encoded using a separator character as specified in the [SkE/CorpusConfig Corpus Configuration] file (attributes MULTIVALUE and MULTISEP), here ";" 
     36{{{ 
     37brush   NN;VV    brush 
     38}}}  
     39 
     40XML tags are use for structural annotation including document, sentence or paragraph boundaries, headlines etc. and can have associated attribute-value pairs.  For example: 
     41{{{   
     42<doc id="G10" n=32>  
     43<head type=min>  
     44FEDERAL  
     45CONSTITUTION  
     46<g/>  
     47,  
     481789  
     49</head>  
     50<p n=1>  
     51"  
     52<g/>  
     53we  
     54the   
     55People   
     56}}} 
     57 
     58There can be any number of attributes associated with words.  While the 'standard' ones are lemma and POS-tag, the framework can also be used for stating thesaurus category, grammatical function, and a number of other varieties of markup.  Sometimes this markup will be most suitably associated with a word, and sometimes with a structural attribute such as phrase, sentence and paragraph.  (There will be different implications on what searches can easily be made, depending on the choice of encoding.)  For the special case of text type or 'header' information, see SkE/PrepareHeaders. 
     59 
     60=== Navigation === 
     61 * Up to SkE/PreparingCorpusOverview 
     62 * Headers: SkE/PrepareHeaders 
     63 * [SkE/CorpusConfig Corpus Configuration file]