Case 4: CRM and Web2.0This data set of CRM and Web 2.0 clusters from Proquest has four entity types: Products, Concepts, Organizations, and Papers. The edges between Products, Concepts, and Organizations are the co-occurrence frequency of those entities calculated from the edges from each to papers. Please note that the organization information is not complete because of the identifier problem and metadata is not available from 3543 papers.
This visualization shows the network of these relationships, with nodes arranged using a Force-Directed Layout. The node for each entity is sized by the number of papers connected to it (how fequently it occurs in Proquest) and colored by the entity type (Product, Concept, Organization, and Paper). Papers are hidden because there are too many to show easily, and they are usually poorly connected. Edges between entities are sized, colored, transparent based on how frequently those entities co-occur.
A few key insights stand out. Call Center is the most frequently appearing entity (2649 times). The blue product cluster in the top-middle (Wikipedia, Facebook, YouTube, etc.) shows that these entities co-occur frequently, and this makes sense being as they all are Web 2.0 web sites with user-generated content. The Google-Microsoft-Yahoo dense triad in the middle-top shows co-occurence in search engine articles, though the General Motors-Ford-Customer Satisfaction triad in the middle-right is less obvious. By filtering out all edges below some co-occurency frequency threshold, say 50, we get the visualization below.
Here we can see the core relationships and insights more clearly, in addition to some data cleaning problems like the SEC-Securities and Exchange Commission barbell in the middle-left.
By automatically clustering the network based on the topology we can see large clusters of related entities (above). We used the Clauset-Newman-Moore (CNM) algorithm and colored the nodes based on the results. Here we see some groupings that make sense intially, and others that are less clear. This could be because CNM doesn't take the edge strength (co-occurence frequency) into account, as well as because of the large groupings generated.
In this image we can see the overall distribution of the entities by node type. We see in the bottom-right figure that there are vastly more papers than the other entity types, but in the top-right we see that they have very few edges connecting them on average. In the bottom-left we have a box and whisker plot that shows the high-degree outliers well. For concepts, the outliers (from top down) are Call Center, Customer Satisfaction, then Blog and Customer Loyalty together. Top organizations are Microsoft, Google, then ATT and Ford together. The only product outlier is Wikipedia.
We can look at the frequency of each entity over time (its trajectory) using line charts for each node type. We can see the Call Center, Blog, Customer Satisfaction, and Customer Loyalty outliers quite well in the top left Concept chart, as well as a sharp increase in Social Media in 2008-2009 that will potentially become an outlier. For Organizations (top-right), we see the Microsoft and Google outliers, but not ATT and Ford. Moreover, while Microsoft has been frequently mentioned for a long time, Google is the new top dog. Product wise, we see the Wikipedia outlier clearly as well as a sharp increase in most every product considered here after 2006.