Setting up parallel corpora in the Sketch Engine
Data preparation
In the Sketch Engine, parallel corpora work as two (or more) independent corpora. To mark that two corpora are aligned (work as a parallel corpus), a special structure align is required in each of the corpora. The alignment has to be strictly 1:1, i.e. the same number of the <align> tags is needed in each of the aligned corpora.
A small example of two source vertical files that are suitable for processing as parallel corpora:
Corpus 1:
<s> <align> This is the first sentence . </align> </s> <s> <align> This is the second sentence . </align> </s>
Corpus 2:
<s> <align> This is the first sentence in corpus 2 . </align> </s> <s> <align> This is the second sentence in corpus 2 . </align> </s>
Note that the sentence <s> tags are not necessary -- alignment uses only the <align> structure.
If you have your corpora aligned 1:1 on sentences (or other structures), it is, however, not necessary to add the align attribute into the vertical file and recompile the corpus. The align structure can be created according to another structure. You are not allowed to do this as a user, however, feel free to contact us and we will add the align structure to your corpora.
Changes in corpus configuration
Two new lines need to be added into the corpus configuration file of each of the aligned corpora. The first one is declaration of the align structure:
STRUCTURE align
The second line is the list of IDs of all corpora that are aligned with the corpus:
ALIGNED "aligned_corpus_id_1,aligned_corpus_id_2"
With this settings, the Sketch Engine will find the aligned corpora and will be able to display parallel results.
Defining parallel corpora using web interface
Parallel corpora can also be set up using the web interface without manually editing the corpus configuration file. This can be done as follows:
- After logging into the Sketch Engine, open the desired corpus from the "My corpora" list.
- Click on "Configure corpus" in the sidebar.
- Use the ALIGNED field to select the parallel corpus.
- Save the form.
You may then need to do the same for the corpus selected as ALIGNED. Note that configuring more than two aligned corpora is also possible, but this can only be done in the expert mode by manually editing the configuration files of the parallel corpora (see above).
Note
Please note that the work with parallel corpora are one of the new features in the Sketch Engine and there are not many users that exploit it. Therefore, there can be unexpected problems. In case anything does not work properly, feel free to contact us and we will solve the problem.
