How to Use DICE

This page will outline a basic procedure that can be followed from start to finish through one iteration of use of the DICE tool. This document assumes that you have already followed the installation instructions.

To begin, create a file in the config directory called term-concept-mapping.txt. This should be a comma-separated file mapping concepts and terms. Fill it out to contain the concepts and terms pertaining to the innovation community you are trying to observe. An example might be the following:

MOOC,MOOC
MOOC,Massive Open Online Course
MOOC,Distance learning

Next, check through the configuration options present in bin/dice/py. Especially, make sure that the databases and steps selected are representative of the run you are interested in obtaining. There is a parser for Wikipedia provided as an example of what a DICE parser should look like. Wikipedia is the default selected database in dice.py and so no additional configuration is necessary to use it. The input and output directories for each step can be left untouched if you wish; they will be created if not present.

To run, from within the DICE program’s root directory:

$ ./bin/dice.py

If a document fails to process during program execution, an error that should be investigated will be displayed on-screen. This can happen if, for example, DICE encounters a document that looks unlike any it knows how to handle or if there are network connectivity issues, etc...

Outline of Processing Steps

The steps, in order of execution, are download, extract, tag, converge-terms, and crowdflower. You can pretty easily add additional steps as needed with modification to dice.py. An overview of what is to be accomplished in each step follows.

  • Download: Raw source material (generally untarnished HTML) will be downloaded.
  • Extract: Formats raw source into structured XML. This allows disparate databases and sources to be presented to later steps in a uniform manner.
  • Tag: Run each document through Stanford NER.
  • Converge terms: Detect terms that are synonymous and replace with a canonical terms. This list must be maintained in config/convergence/txt. A single convergence file is shared between all of DICE, although the convergences file to be used can be overridden in dice.py.
  • Crowdflower: Create a csv file for upload to Crowdflower that will pick smaple sentences to be presented to workers to determine which organizations that were detected are legitimate and which are not.