Thursday, December 3, 2015

Hairy Visualization Problems

Modern data visualizations are often tours de force of artistic and graphic design. They can transcend the data and transform into artwork worthy of MoMA. These visualization often derive their visual impact for the sheer quantity of data they display. And while anyone would rather have this adorning their wall than an delimited data file, the visualization is scantily more useful for deriving useful conclusions than the underlying data file.

Mapping the Internet, by Barrett Lyon. Part of MoMA's Architecture and Design collection.
Force-directed networks, one the D3-iest of D3 layouts, often suffers this fate.  As Martin Krzywinski, who describes these visualizations as "hairballs," puts it...
The central drawback of hairball-based visualization is that they cannot be tuned to address a user's specific questions. Implicit in the hairball approach is the assumption that all questions that the user wishes to answer are addressable by the layout algorithm. When this assumption is wrong (as it usually is), the user is left to construct another hairball, based on another layout algorithm, to attempt to answer the unanswered questions. Unfortunately, the set of questions answerable by a hairball is very difficult to determine — no such list exists because of the complex interplay of data and layout.
So how do we reappropriate layout decisions to better answer questions relevant to our data?

  1. Develop a hypothesis
  2. Map the parameters of the hypothesis onto parameters governing the layout
  3. Run the layout algorithm
  4. Determine whether the hypothesis is correct

You'll notice that the common plug-and-chug method of implementing force-directed network graphs skips steps 1 and 2, which makes step 4 exceedingly hard.  Remember, absent intervention, d3.layout.force knows only about connectivity, and there's usually more to the story.  Look what the force-directed algorithm will let you do with the United States:
Force-directed layout of the United States, with initial positions seeded (source).
Force-directed layout of the United states, with alternate initial positions. Because the algorithm knows only about connectivity between the states, their relative arrangement is nonsensical. Don't let this happen to your data.

In the next post, we'll discuss how to bring some intelligence to these visualizations so that they can be used to inform, rather than simply to decorate walls.