Data Collection and Processing

Raw Data Collection

Two types of raw data were collected from multiple sources: ACM digital library, IEEE Xplore, Proquest and LexisNexis. Articles numbers were automatically retrieved from ACM digital library, IEEE Xplore and Proquest; while Full textual paper were download manually from LexisNexis. The collecting process is following five steps: Step1: Identify the new concepts. e.g. "tree map", "cloud computing"

Step2: Query formulation and expansion. e.g. "tree map" or "tree maps" or "treemap" or "treemaps" or "tree-map" or "tree-maps"

Step3: Understand the search system

Step4: Create the script

Understand and parse the URL-> generate new URL->parse the web page->output the result

Step5: Use the script to collect article number (trend data), or manually download the full textual paper.

Information Extraction

This study explored Natural Language Processing and Social Computing approaches to automatically extract information from data collected in the first phase. Two types of information are being extracted:

  1. Entities that are related to and have influence on the advancement of innovations. This includes academic institutions, industry companies/organizations as well as people such as organization's decision makers.
  2. Relations between entities and innovations, e.g. the adoption relation.

Two types of Natural Language Processing methods are used to identify entities and relationship:

  1. Named Entity Recognition (NER) methods that automatically identify and extracting institution and people's names from text corpus.
  2. Adoption Relation Extraction (ARE) that automatically identify the adoption relations by identifying adoption announcements from the corpus. To improve the accuracy of current state-of-the-art approach, we utilized crowd-sourcing methods by adopting Social Computing methods such as Mechanical Turks to identify entity and relation instances that are hard to capture with state-of-the-art methods but easy to identify by humans.

Two types of social Computing systems are used to gather new entities and relationships, and validate automatically extracted information.

  1. Mechanical Turk that helps to validate the automatically extracted information.
  2. An STICK wiki system that helps to gather entities and relationships and their descriptions.

Figure 4. Detailed Steps in Data Collecting and Processing