Extending DICE

Adding your own web scraper

The task of adding a web scraper of your own can really run the gamut of difficulty. It really depends on the complexity of the resource from which you are acquiring documents. While this guide is meant to be helpful to a very broad purpose, please make sure that you are not breaking any laws while scraping. Many sites do not allow scraping without exceptions and others may require express permission. The authors of DICE encourage you to use discretion and good judgement.

Wikipedia is provided as an example. Example code can be found in <DICE>/database-specific/wikipedia.

To get access to DICE library functions, add an import statement like the following:

sys.path.insert(0, "%s/../../lib" % os.path.dirname(sys.argv[0]))
import dice

DICE module ~ common functions

dice.associate_document_with_concept(concept, term, document_identifier)

Write the given concept, term, and document_identifier to a csv file so that we can keep track in the extract step of what concepts and terms introduced a document into the dataset.

A document_identifier can appear multiple times in the pairing document.

dice.check_pairing_doc_exists()

Provides an way to error out if the pairing document is not present

Returns: True if present, False otherwise

dice.gen_document_terms_map(database_prefix='-')

generate a map of terms associated with each document.

Params : database_prefix: e.g. “wikipedia-“

This function will look through _DICE_CONCEPT_TERM_DOCUMENT_PAIRINGS to find entries relevant to the database given by database_prefix.

Returns: document_terms: a dictionary
key: document_id

value: list of terms associated with the document

dice.get_concept()

Return the concept for the current run

dice.get_default_delay()

Return how long to wait in between downloading articles

Returns: 1

dice.get_term()

Return the term for the current run

dice.start_NER_server(port=8080)

Calls a wrapper script to start up Stanford NER in the background

dice.url_escape_query(term)

If the term given has spaces, wrap the term in double quotes. Next, run the term though urllib2.quote.

The value returned is useful for submitting an escaped value to a form.