Questions and Answers on Using WebBootCat
(see also the help page Add data from the Web with WebBootCat, there is also a video avi format here, ogv format here on creating a corpus and using WebBootCat)
Q: To reach the 10-million level, how many seeds should be used roughly?
A: The more seeds the better as this will generate more varied queries. I think you should aim for 70-100 seeds if that is possible in your domain. Please note that you can use multiwords such as "kick the bucket" using the quotes, and also proper names of different kinds.
Q: When I do my test searches using WebBootCat, I find it difficult to decide what are the optimal settings for my specific objective. Are there any specific gains to be made from manipulating the advanced options settings, i.e. tuple size, URLs, keywords, etc.?
A: Any restrictions on the file size will of course affect how much data is retrieved.
There are limits on the Max tuples (100) which is the number of queries (tuples i.e. seed combinations) to be sent to the search engine, and a limit of 50 URLs to be returned per query (tuple). I would use these maximum values to get as much data from your seeds as possible. You can of course additionally rely on an iterative process to enlarge the dataset (see below).
The white list keywords can be useful to alleviate ambiguity of the seeds, i.e. you can use some of the unambiguous seed words to make sure there is a good proportion of domain words in the document. Black list keywords can also be used to reduce ambiguity (e.g. you might use "party" when collecting a corpus on the environment using seeds which include "green"). I would use the whitelist and blacklists if you find you are getting irrelevant documents (otherwise it is probably not necessary). If you do want to use them then you can play with the parameters to see what helps, but probably start with the default settings.
Q: What is the best strategy in order to confine the data to British English only -- just using .uk as my sites list?
A: Yes, to restrict to uk domains, add .uk to the site list
Q: How do you extract new seeds for the recrawling process -- and how do you apply them in order to keep the collected data at the 10-million level?
A: To do the new iteration / seed word extraction: when on the home page, select the corpus and on the LHS menu you should see "extract keywords", I would use the default options to begin and then change them if that makes intuitive sense given your data. Then you can select from these extracted keywords as new seed words for WebBootCat. You give this crawl a (collection) name which is a partition within your corpus. You can repeat the process as much as you like. You can see how much data you have at each stage
