Crawling for Corpora

Are you looking for a large, plain-text English corpus for training language models, but unwilling to cough up $6,000 for the English Gigaword corpus? One option is to build your own with web crawlers. In this post I describe my process for building a quick, free crawler of dubious quality and almost no reliability. Note that the purpose of this crawler is to extract plain-text from web pages, and to extract links only for the purpose of finding new pages to crawl. This is not intended for building a search index.

Prepare the server

I use a Linode 512 virtual server running Ubuntu. I installed and configured a MySQL server. I also installed MySQL-Python for connecting to the DB from a Python script.

Write the crawler

The crawler should connect to the MySQL DB, request and open the initial page specified, extract links and plain-text from the page, save the URL and the plain-text to the DB if the URL does not already exist, and add the extracted links to a list of pages to crawl.

Some notes:

  • In the request’s header, I set the User Agent to Mozilla/5 (Solaris 10) Gecko. (I am not running Solaris 10). Some sites, including Project Gutenberg, reject robots. Setting the User Agent can trick some of these sites into serving your request. (This is a feeble trick that is equivalent to the robot wearing a name tag that says ‘Hello, my name is I am Not a Robot’. Sites that use JS to check for mouse movement or check if a small image file has been downloaded will not be fooled).
  • BeautifulSoup is used to extract links and plain-text (here I am only extracting plain-text from paragraphs; this does not capture all of the page’s visible text, but I am not interested in text from navigation, etc.).
# -*- coding: utf-8 -*-

import urllib2, time
from BeautifulSoup import *
from urlparse import urljoin
import MySQLdb as mdb

class Timer:

    def __enter__(self):
        self.start = time.clock()
        return self

    def __exit__(self, *args):
        self.end = time.clock()
        self.interval = self.end - self.start

class Crawler():

    def crawl(self, pages, depth=2):
        with Timer() as t:
            startSize = self.getIndexSize()
            for i in range(depth):
                newPages = set()
                for address in pages:
                    try:
                        request = urllib2.Request(address)
                        request.add_header('User-Agent', 'Mozilla/5 (Solaris 10) Gecko')
                        opener = urllib2.build_opener()
                        feeddata = opener.open(request).read()
                    except:
                        #print 'Could not open %s' % address
                        continue
                    soup = BeautifulSoup(feeddata)
                    self.addToIndex(address, soup)    
                    links = soup('a')
                    for link in links:
                        if('href' in dict(link.attrs)):
                            url = urljoin(address, link['href'])
                            if url.find(' ') != -1:
                                continue
                            url = url.split('#')[0]
                            if url[0:4] == 'http' and not self.isIndexed(url):
                                newPages.add(url)
                        pages = newPages
                        #print 'New pages: %s' % pages
                        currentSize = self.getIndexSize()
        print('Added %s documents in %.03f seconds. Index size is: %s\n' % (currentSize-startSize, t.interval, currentSize))

    def getText(self, html):
        text = html.findAll(text=True)
        visible_texts = filter(self.visible, text)
        return visible_texts

    def addToIndex(self, address, soup):
        cur = con.cursor()
        lines = [x.findAll(text=True) for x in soup.findAll("p")]
        lines = [u''.join(line) for line in lines]
        text = str(u'\n'.join(lines).encode('utf-8').strip())
        param = {
            'url': address,
            'html': text
        }     
        #print 'html: %s' % text
        #print 'type: %s' % type(text)
        with con:
            cur.execute("""
            insert into documents(url, html) values(%(url)s, %(html)s)""",
            param)

    def isIndexed(self, url):
        indexedUrls = []
        cur = con.cursor(mdb.cursors.DictCursor)
        with con:
            cur.execute("SELECT * FROM documents")
            rows = cur.fetchall()
            indexedUrls = [row['url'] for row in rows]
            if url in indexedUrls:
                return True
            else:
                return False

    def getIndexSize(self):
        cur = con.cursor()
        with con:
            cur.execute("select count(*) from documents")
            documentsCount = cur.fetchall()
            return documentsCount

    def __del__(self):
        con.close()

    def __init__(self, dbname):
        global con
        con = mdb.connect('localhost', 'yourUserName', 'yourPassword', dbname);
        print 'Starting crawler. Connecting to %s' % dbname

 Run the crawler

I also wrote several scripts to run the crawler:

# -*- coding: utf-8 -*-
import crawler

pages = ['http://www.gavinmhackeling.com/blog']
crawler = crawler.Crawler('yourDB')
crawler.crawl(pages)

This specifies the page to start crawling from and the DB to connect to. On my Linode server, I run about 3 of these, each with a different list of pages to start from (and draw from if the crawler reads all of the pages in a clique). I execute them with the following command, which runs the script in a separate shell but does not log the output:

nohup python script.py > /dev/null &

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>