wiki:SkE/Help/CreateCorpus

Creating a Corpus using the Interface, and Compiling it for Sketch Engine

To create a corpus in the interface, login and go to the home page (if already logged in, you can get to this page by clicking home top right of the screen.)

The four sections in this page describe:

  • 1) the process of creating a corpus in the Sketch Engine interface
  • 2) adding data to the corpus by uploading a file
  • 3) adding data to the corpus using WebBootCat
  • 4) the functions that are available on a user (your) corpus, including compiling it for Sketch Engine

Sections 1 and 3 are exemplified with a video tutorial:

1. Creating a Corpus

Near the top of the left hand side margin click Create corpus. You are then taken through a series of three steps.

  • Step 1 fill in the following fields:
    • Corpus id: Give the corpus an id that will be used by Sketch Engine. This must not be the same as any other id already used. The id cannot contain spaces and must be comprised of letters, numbers, underscores and hyphens.
    • Corpus name: (optional) the full name that you wish to give your corpus, this will be displayed in the interface.
    • Info: (optional) any useful information pertaining to the corpus
    • Language: choose the language for your corpus. This is required to determine how Sketch Engine should process the data.
  • Step 2 Specify the configuration of the corpus by either:
  • Step 3 Select from the available Sketch Grammars or upload a new one

The Corpus now exists as an identity but does not yet contain any data. You can select the corpus (you will already be on that page from the last Step) and add data using either:

  • Your own files see section 2)
  • Data from the Web using WebBootCat (paper published: Proc EAMT 2006, Oslo, Norway) see section 3)

2) Add a File

If you select add a new file you have options to

  • upload a file from your computer
  • download from a URL
  • upload from somewhere on the Sketch Engine server (you need to FTP the files there first)

Supported file formats include txt, html, pdf, doc and vert. An xml file is also possible if you upload it as plain text but it should only contain text with structural mark-up (such as document or paragraph boundaries; document metadata, etc). More complex xml will not be processed correctly. Here is a sample of xml text that would be processed correctly:

<xml>
<doc author="Jan" title="Example doc 1">
<p>This is a paragraph.</p>
<p>This is another paragraph.</p>
</doc>
<doc author="Jan" title="Example doc 2">
<p>I will add some more text here.</p>
</doc>
</xml>

Note that you can also add multiple files in an archive using formats: .zip, .tar, .tar.gz, and .tar.bz2.

You need to (re-)compile the corpus after adding one or many files (see below).

3) Add data from the Web with WebBootCat

Provide a name for the collection (this portion of the corpus) and select either seed words or URLs.

There are various advanced options that you can specify:

  • 1) for the search engine that will deliver the results:
    • File type: you can restrict the type of file that can be included
    • CC only: restrict your search to those documents available under the Creative Commons license.
    • Tuple size: the number of your seed words to be combined together for each web query
    • Max tuples: the maximum number of tuples (queries) to be sent to the search engine for finding the list of URLs. There is a limit of 50 on this field.
    • Max URLs per query: the maximum number of URLs to be retrieved per query (tuple). There is a limit of 100 on this field.
    • Sites list: list of sites which can be used to restrict the search eg. co.uk will match all URLs ending in .co.uk
  • 2) Size restrictions Here you can specify a minimum and maximum for the size of each file to be included (in Kb) and/or of the number of words contained in each file to be included. (the number of words includes punctuation in the word count).
  • 3) White list keywords: In this section you provide a list of words which files should contain to be included. Matching is case-sensitive and phrases can be enclosed in quotes e.g. "bread and butter". There are options to specify:
    • the minimum number of instances of any of these keywords that a web page (file) must contain after processing
    • the minimum number of types which a web page (file) must contain. I.e. if cherry banana apple are white list keywords and cherry appears in a page three times, it will only count as one type.
    • Min keywords ratio: the minimum ratio of key word instances to non keyword instances that must occur on a web page for it to be included
  • 4) Black list keywords: Here you can provide a list of words which must not be contained for the web page to be included. There are equivalent options as for white list keywords.

When you click OK you are taken to a screen which shows the URLs that are retrieved from each tuple (query) made by combining the set number (tuple size) of the seed words. You can deselect any of these URLs at this point.

WebBootCaT: Found URLs

When you click OK you are then taken to a screen where you see the output of WebBootCat as it retrieves the documents and processes the files. The output shows you when there are problems retrieving files, for example because the URL has blocked automatic retrieval or where the file is not included because it does not match any of the options that you specified.

WebBootCaT: Progress bar

Having retrieved the corpus you should then compile the corpus (see below).

Questions and Answers on Using WebBootCat

4) Corpus options when the corpus has been made

When you have created a corpus there are many tools available to you in the left hand side panel. Select the corpus by clicking on its name from the home page and under the Corpus heading in the left hand side menu to can:

  • Add new file: See section 2 above.
  • Add web data (BootCaT): See section 3 above.
  • Compile corpus
    • Necessary when you have added new data or changed the sketch grammar.
    • At the compile stage, you need to select the xml tags from your files that should be used as structures in your corpus. You also need to specify the structure used for references which will be used to enclose the data from each file that you uploaded. This must be different to any of the other structure names that you have already used in your file. By default this is doc.
    • You also have the check box option to use the program "onion" which will automatically remove duplicate content from your corpus. If you opt to use onion then you can specify which structure the program will consider when removing duplicates (for example, at the document, paragraph or sentence level).
  • Open in SkE: Open your corpus in Sketch Engine. This is an alternative link to that from the same magnifying glass icon on the main screen. You need to compile a corpus before you can open it (see above).
  • Extract keywords: This utility can extract keywords from your corpus by comparing it with a reference corpus, such as UkWaC. There are various parameters you can set as well as the reference corpus:
    • the attribute to be used, for example word (form) or lemma
    • you can choose to exclude stop words provided there is a list for the language (e.g. in English we have a list of closed class 'function' words e.g. the about of, which you might wish to exclude)
    • you can select words which only include alphanumeric characters and/or those which contain at lease one alphabetic character
    • you can specify a minimum word length
    • you can specify a minimum frequency in your corpus for any extracted keyword
    • you can specify a limit on the number of keywords that can be extracted
  • Configure corpus: Configure the corpus either using the interface for a few options or you can first select Expert mode and in Expert mode you can manually edit the corpus configuration file (see SkE/CorpusConfig and SkE/Config/FullDoc).
  • Download corpus: Download the corpus as text or in vertical format. Vertical format is useful if you want to retain any of the structures for uploading back into Sketch Engine.
  • Access privileges: Here you can specify access for users or groups (you can define groups of users using the User groups function in the left hand side menu above the Corpus and Admin options. Access can be granted for:
    • read only (they can view but not change),
    • upload (they can view and add new data) or
    • full (they will have full access and can change the configuration or recompile the corpus as well as add data to it)
  • View logs: You can view the output from compiling the corpus or the WebBootCat program used in creation of the corpus.

Attachments