Dataviz book

My book Data Visualization: charts, maps and interactive graphics was published in 2018 as part of the series "Statistical Reasoning in Science and Society," a collaboration between CRC Press and the American Statistical Association.

There are several good books available on dataviz, so I wanted to make sure this offers readers something different. Most contemporary dataviz books come from a design or journalism background, and there hasn't been an all-round statistical one since William Cleveland's in 1993. Mine covers many statistical techniques and many visualisation formats. I argue that, to make great visual communications of analytical work, one has to consider the graphical and statistical elements together. The analysis informs good visualisation, and the visualisation informs good analysis.

It is short, affordable and accessible to all: no algebra. There is a series of short chapters, each focused on one specific task, such as time trends, maps or interactivity. You can read a chapter in a coffee break, so you can easily start to build your awareness of dataviz.

Click here to view the Table Of Contents.

I have three audiences in mind:

  • The data analyst who was never taught about good visualisation and wants to expand their skills
  • The budding dataviz designer, whether you are at high school and considering careers, or coming to the end of studies in statistics, design or web development.
  • The boss who has to hire, or commission, someone to make dataviz for his/her organisation, monitor their progress and be assured of good results

I wrote an article called Calculate And Communicate about why dataviz is a great field to work in, and why I needed to write yet another book on the subject. This is in Significance magazine's December 2018 issue.

The making of...

Here's where you can find all the code I used to generate my own images in the book. There are also some blog posts expanding on various images from the book. The chapters' code is being added right now and should all be available bythe end of 2018. There is a corresponding GitHub repository for those of you who like that kind of thing.


Table of contents

Click here for a detailed table of contents in PDF

    Section I: The basics
    1. Why visualize?
    2. Translating numbers to images
    Section II: Statistical building blocks
    3. Continuous and discrete numbers
    4. Percentages and risks
    5. Showing data or statistics
    6. Differences, ratios, correlations
    Section III: Specific tasks
    7. Visual perception and the brain
    8. Showing uncertainty
    9. Time trends
    10. Statistical predictive models
    11. Machine learning techniques
    12. Many variables
    13. Maps and networks
    14. Interactivity
    15. Big data
    16. Wrapping up the package in a report, dashboard or presentation
    Section IV: Closing remarks
    17. Some overarching ideas

Chapter 1: Why visualize?

Fig 1.1 was manually annotated onto a scatter plot of train delays (but not the same dataset that we use in the rest of the book) in Inkscape. I only saved the raster image, like an idiot (this is the only such instance in the whole book! always keep an SVG of everything!). Still, you can look at the unannotated version (generated from Stata) here.

Fig 1.2 was a screenshot from DrawMyData, a little interactive webpage I made a while back, taken by Alberto Cairo after he drew the Datasaurus, and kindly shared with me. The Datasaurus became his most popular blog post ever and got incorporated into a lot of other people's work, a small minority of whom acknowledged poor old Alberto. The most ingenious of these was Autodesk Research, who got my methodviz of 2017 award for it.

Fig 1.3 was made in R with some pretty standard and ugly base code. See Ch1.R; the SVG file is here.

Figs 1.6-1.9 were made, as you can see in the book, with paper, tracing paper, acetate transparency, coloured pencils, sticky tape and fine nib whiteboard marker pens. The map is by Google and the data are based on (but not accurate to) that from the book "Statistics: unlocking the power of data" by Lock, Lock, Lock, Lock and Lock. I recommend this book for beginners in statistics, by the way. You can also find the dataset on their website StatKey.

Fig 1.10 was made by creating the map in Mapbox, saving it as a raster, importing it into Inkscape so it became the background to an SVG file, and then adding the other stuff (rings, bars, text, dotted lines) on top. I got the basic layout right in Inkscape then added the bars, circles and lines in the text editor because it was easy to get everything lined up, and to keep the code simple. The SVG file is here.

Figs 1.4 & 1.5 were made by other people.

Chapter 2: Translating numbers to images

Figs 2.1-2.6 are generated from base R graphics (see traindelays-fourweeks.R. Because they are all showing variations on the same plot, there is one function defined at the beginning, with some options, such as to show markers or lines, and then this is repeatedly called inside png() and svglite::svglite() graphics devices. This ensures that there is consistency in things like margins and labeling, because they can be changed in just one place and affect all the graphs generated from that function.

The data was assembled from a few different Excel spreadsheets published by the Office of the Rail Regulator (now, the Office of Rail and Road). You can get my cleaned up version in trains.dta, which is Stata format and can be converted as you please via R with haven::read_dta().

Figure 2.1 is intended to be a basic scatter plot (the SVG file is here), and Figure 2.2 a basic line chart of the same data (the SVG file is here). I like this time series and how it has served several different purposes in the book. Here, the occasional spikes in the line chart show that a line chart emphasises outliers by putting more ink on the page for them. Sometimes, that is what you want to do, but sometimes not.

Figure 2.3 is about highlighting (and overloading the reader), so it is the same as 2.2 but with extra markers or rectangles added in R. The SVG file is here. Figure 2.4 changes the encoding of variables to axes (the SVG file is here), and Figure 2.5 adds color coding to Figure 2.4 (the SVG file is here).

Figs 2.3-2.5 are pairs of images put together into one in Inkscape. Nothing else was changed. Since I did that, the function multipanelfigure came along. That could have saved me a lot of time.

The images in Table 2.1 and figure 2.7 were just typed into a text editor. If you look at the SVG files 2-different-lines.svg and the various files beginning 2-table1... on Github, you'll see how simple the SVG can be to read and write, once you understand a few basics.

Figure 2.8 is made in Inkscape. I downloaded an SVG train icon from Wikimedia Commons (I chose a particularly stylized one), and repeated it in rows, cropping one, and then adding text. The SVG file is here.

Figure 2.12 was just SVG typed into a text editor; the SVG file is here.

Figure 2.13 was generated as repeated lines of text in R (see Ch2.R) that made up an SVG waffle. The SVG file is here, and if you look at it is the text editor, you'll see how each line is written out, changing only the x and y coordinates as it goes along.

Figures 2.6 and 2.9-2.11 are made by other people.

Chapter 3: Continuous and discrete numbers

All the R code is in Ch3.R.

Figure 3.1's strip chart or dot plot was generated with the R function stripchart(). I wrapped that in a function drawstripchart() so that parameters would only need to be specified in one place, making sure that SVG and PNG version match. The SVG file is here.

The histograms in Figure 3.2 were generated separately for 10 bins, 20 bins, and 50 bins, combined in Inkscape (the SVG file is here), and the common titles were lined up there.

The kernel density plots in Figures 3.3 and 3.4 use standard base R functions, with default Gaussian kernels and bandwidth set to the all-round recommended SJ option. The SVG files are here and here.

Figure 3.5 was combined in Inkscape from three plots generated by ggplot2::ggplot(). The plot with two histograms used geom_histogram and facets by city. The plot with superimposed kernel densities used geom_density with color set by city. The heatmap used geom_tile and encoded city to the vertical axis. These reflect three different ways of practically comparing two groups in your data within ggplot2. There are SVG files for the histogram, the kernel density, and the heatmap, as well as the combined final version.

Figure 3.6 is a straightforward violin plot; the SVG file is here.

Figure 3.8 is composed of four individual line charts with a particular quartile highlighted each time. The vertical bars in grey and red are added manually in Inkscape and the plots put together into one image; the SVG file is here, and the individual plots are named 3-small-change-... This is an example of the sort of composite dataviz that you could either create in one pass from complex code, or put together simpler plots with some manual editing. Often, the choice is driven by whether you are likely to use the code again with new data -- which would argue for complex code without manual intervention.

Figure 3.9 is the sort of composite plot where functions have already been written to make their creation simple for you (I used ggscatterhist()). The SVG file is here.

Similarly, 3.10 comprises a hexagonal bin plot and a contour plot, both of which can be obtained quite simply in R. There are SVG files for the hexbin, the contour, and the combined image.

Figures 3.11 and 3.12 are simple base R plots, except that the right part of Figure 3.12 is actually faked (R does not, except perhaps in foolish user-written packages, permit axis-breaking) by plotting the last point at y=26, then adding the pair of diagonal break lines and changing the label values below them on the y-axis manually in Inkscape. There are SVG files for Figure 3.11 and Figure 3.12.

Figure 3.7 was not made by me, but by Henrik Lindberg, who does a lot of other cool dataviz stuff. There have been many charts along these lines, most famously the album cover mentioned in the caption, but this one got a lot of attention when I was writing the book, is nicely designed, revived the format, and is quite accessible and interesting in itself.

Chapter 4: Percentages and risks

All the R code is in Ch4.R and the Stata code in

Figure 4.1 started as a bar chart in Stata. I opened this in Inkscape, copied everything to appear again alongside, then replaced the bar with a series of rectangles and added annotation down the side. The three colors are generated from as ternary colors, that stand out clearly together, though I then chose to brighten up the red somewhat. That red originally came from a leaf in my local park, sampled with the ColorGrab app on my phone.

Figure 4.2 is a very simple waffle plot. I felt that none of the R waffling options were simple and clean enough, so I just wrote a loop that writes out SVG code into a text file. Each pass through the loop is a square, and pulls in the relevant color (hex code). This got labels added in Inkscape.

Figure 4.3, a somewhat tongue-in-cheek illustration of a ternary plot, with a tip of the hat to an old Mars bar advert, was made entirely in Inkscape.

Figure 4.4, a simple bar chart with two bars, is entirely made in Stata.

Figures 4.5 (a stacked bar chart) and 4.6 (two clustered bar charts) were generated from Stata. The two parts of Figure 4.6 were combined in Inkscape and the font enlarged accordingly.

Figure 4.7 (two clustered bar charts) was made by removing the bars from the SVG of Figure 4.6 and adding new bar SVG code generated in R. I wanted to keep the rest of the graph identical in dimensions to Figure 4.6, although I lived to regret it. This is not an easy way to make a graph, and you'd only consider this kind of approach if you are writing a book!

Figure 4.8, a parallel sets plot, a.k.a. Sankey diagram, was generated from R; there are many options for this but I went with ggalluvial. You might like to compare it with the (to my mind) inferior version.

Figure 4.9, a treemap, is created from R and then had labels edited in the SVG code.

Figure 4.10 is a tree containing three waffle plots. The waffles were made in R, then combined in Inkscape. The tree part was created there.

Chapter 5: Showing data or statistics

Check back in October for The Making Of...

Chapter 6: Differences, ratios, correlations

Check back in October for The Making Of...

Chapter 7: Visual perception and the brain

Check back in October for The Making Of...

Chapter 8: Showing uncertainty

Check back in October for The Making Of...

Chapter 9: Time trends

Check back in October for The Making Of...

Chapter 10: Statistical predictive models

Check back in October for The Making Of...

Chapter 11: Machine learning techniques

Check back in October for The Making Of...

Chapter 12: Many variables

This R script file accompanies my blog post on the Saturn images. The rest of the chapter will appear here after publication on 19 September 2018.

Chapter 13: Maps and networks

Check back in October for The Making Of...

Chapter 14: Interactivity

Check back in October for The Making Of...

Chapter 15: Big data

Check back in October for The Making Of...

Chapter 16: Wrapping up the package in a report, dashboard or presentation

Check back in October for The Making Of...