Sketch Engine
  • Login
  • Wiki
  • Timeline
  • View Tickets
  • New Ticket
  • Search
  • Settings

Ticket Navigation

Ticket #80 (new bug report)

Opened 8 months ago

Last modified 2 weeks ago

Bootcat

Reported by: PA Fauconnier-Bank <pf29@st-andrews.ac.uk> Assigned to: jan
Priority: high Component: WebBootCaT
Version: stable Keywords:
Cc:

Description

Hi there,

I have a list of websites which has been checked as working, but bootcat always stalls (bizarrely at different points) and never completes a corpus.

Thanks for any tips/fixes,

P.

Attachments

Change History

03/28/08 21:52:49 changed by jan

Hello, could you please indicate which URLs cause problems and how the list of URLs is obtained, i.e. which seed words and what BootCaT options do you use?

Thank you, Jan

03/28/08 22:34:35 changed by PA Fauconnier-Bank <pf29@st-andrews.ac.uk>

Hi there.

The URLs are from a Top 100 blog type list. I am trying to set bootcat up to build as big a corpus as possible from the list. Ideally I'd like to just grab them all. Maybe I'm not doing this efficiently.

I appreciate your help and advice.

Philippe.

The urls have all been run through a link checker to make sure they resolve.

Site list: www.libdemvoice.org www.alexfoster.me.uk www.liberalreview.com liberalengland.blogspot.com peterblack.blogspot.com www.libdemblogs.co.uk www.lynnefeatherstone.org/blog.htm loveandliberty.blogspot.com jockcoats.blogspot.com www.pickledpolitics.com www.northumbrian.org.uk www.willhowells.org.uk/blog www.liberalreview.com/blogs/apollo ericavebury.blogspot.com www.theliberati.net/quaequamblog susannelamido.blogspot.com oxfordliberal.blogspot.com johnhemming.blogspot.com cicerossongs.blogspot.com ballotsballsandbikes.blogspot.com andershanson.wordpress.com alexcolehamilton.iblogs.com andymayer.blogspot.com bullseye-liberaldissenter.blogspot.com paulwalter.blogspot.com www.nickbarlow.com/blog leftleaningpolitics.blogspot.com forcefulandmoderate.blogspot.com liberalbureaucracy.blogspot.com kenowen.blogspot.com www.martintod.org.uk/blog random-incident.journalspace.com eatenbymissionaries.blogspot.com oberon2001.blogspot.com gingeranddynamite.blogspot.com innerwestcentral.blogspot.com pinkdogster.blogspot.com linlithgow-libdems.blogspot.com essexmoonlight.blogspot.com arwenfolkes.blogspot.com www.richardallan.org.uk anngarner.blogspot.com paswonky.blogspot.com cllrrthomas.wordpress.com chrisandglynisabbott.blogspot.com www.eridu.org.uk/blog blog.artesea.co.uk jonathanwallace.blogspot.com paulakeaveney.blogspot.com radders73.blogspot.com www.colin-ross.org.uk woyce.blogspot.com guyburton.blogspot.com crumblehall.blogspot.com richardbaum.blogspot.com blog.biscit.me.uk www.sguy.net liberallegend.blogspot.com www.maryreid.org.uk www.stockton.gov.uk liberalneil.blogspot.com www.paulcrossley.me.uk www.alanmuhammed.co.uk/blog www.simonwright.org.uk/news romseyredhead.blogspot.com frasermacpherson.blogspot.com revsimonwilson.blogspot.com markjohnyoung.spaces.live.com neilwoollcott.blogspot.com joeotten.blogspot.com blog.stodge.org politicalicecream.blogspot.com millenniumelephant.blogspot.com obbfcouncillor.blogspot.com pigeon-post.blogspot.com homepage.mac.com/tgarden/iblog/B2067696994/index.html www.wildbard.com/lunartalks.html alan-beddow.blogspot.com republicofhydepark.blogspot.com www.mingcampbell.org.uk www.wycombelibdems.org.uk www.flocktogether.org.uk/blog onlibertyonline.blogspot.com cllrdavidwalker.org/wordp tizzielizzie.blogspot.com andrewjgarner.blogspot.com mindrobber.blogspot.com theliberati.net/drink politsmk.blogspot.com walcot.blogspot.com

Key words (these are entered into the text box as I have been unable to load keyword files)

globalisation globalization individual market agency World bank IMF Economy Liberalism economics neoliberalism state community capitalist free bank investment stock shares trade aid exchange rate percent barter border country government chancellor federal reserve point points FTSE DOW NASDAQ AIM Contract negotiation sweatshop exploitation outsourcing deficit borrows borrowing debt public private partnership stakeholder socio-economics dollar euro pound sterliing monetary kenesyian priming pump policy quarter results report annual profit loss city wallstreet street wall lse commodity oil barrel opec wto g8 davos summit treaty merger acquisition takeover indemnify bonds debt credit crisis black hedgefund sellshort shorting short daytrader daytrade shortsell long futures options leverage vehicle market "market metaphor" "market system" "free market" "laissez faire" marketplace capitalism globalisation globalization economics "financial news" "economics news" "economic analysis" society "current affairs" newspaper

Options: Tag Corpus

Tuple Terms: 2 Max tuples: 50 Max urls: 50

The rest corpus.

04/03/08 11:20:07 changed by jan

Dear Philippe,

please note that you can only build a corpus of 500,000 words with the WebBootCaT unless you have the quota expanded. With 50 tuples and 50 URLs per tuple you aim at a corpus of 2500 documents. That will definitely exceed 500,000 words. Therefore I believe your problem is that you're simply running out of free space in the WebBootCaT.

Best, Jan

04/03/08 11:26:02 changed by jan

As for grabbing content from a list of URLs, that's not currently possible with the WebBootCaT. However, you might have success with the BootCaT toolkit. Jan

04/07/08 14:12:09 changed by vojta

  • owner set to jan.
  • component changed from not specified to WebBootCaT.

Add/Change #80 (Bootcat)




 

Download in other formats:

  • Comma-delimited Text
  • Tab-delimited Text
  • RSS Feed

Sketch Engine
Bringing Corpora to the Masses

Lexical Computing Ltd

Brought to you by
Lexical Computing Ltd