Dataviz book - Robert Grant stats

Chapters:

Click here for a detailed table of contents in PDF

    Section I: The basics
    1. Why visualize?
    2. Translating numbers to images
    Section II: Statistical building blocks
    3. Continuous and discrete numbers
    4. Percentages and risks
    5. Showing data or statistics
    6. Differences, ratios, correlations
    Section III: Specific tasks
    7. Visual perception and the brain
    8. Showing uncertainty
    9. Time trends
    10. Statistical predictive models
    11. Machine learning techniques
    12. Many variables
    13. Maps and networks
    14. Interactivity
    15. Big data
    16. Wrapping up the package in a report, dashboard or presentation
    Section IV: Closing remarks
    17. Some overarching ideas

Chapter 1: Why visualize?

Fig 1.1 was manually annotated onto a scatter plot of train delays (but not the same dataset that we use in the rest of the book) in Inkscape. I only saved the raster image, like an idiot (this is the only such instance in the whole book! always keep an SVG of everything!). Still, you can look at the unannotated version (generated from Stata) here.

Fig 1.2 was a screenshot from DrawMyData, a little interactive webpage I made a while back, taken by Alberto Cairo after he drew the Datasaurus, and kindly shared with me. The Datasaurus became his most popular blog post ever and got incorporated into a lot of other people's work, a small minority of whom acknowledged poor old Alberto. The most ingenious of these was Autodesk Research, who got my methodviz of 2017 award for it.

Fig 1.3 was made in R with some pretty standard and ugly base code. See Ch1.R; the SVG file is here.

Figs 1.6-1.9 were made, as you can see in the book, with paper, tracing paper, acetate transparency, coloured pencils, sticky tape and fine nib whiteboard marker pens. The map is by Google and the data are based on (but not accurate to) that from the book "Statistics: unlocking the power of data" by Lock, Lock, Lock, Lock and Lock. I recommend this book for beginners in statistics, by the way. You can also find the dataset on their website StatKey.

Fig 1.10 was made by creating the map in Mapbox, saving it as a raster, importing it into Inkscape so it became the background to an SVG file, and then adding the other stuff (rings, bars, text, dotted lines) on top. I got the basic layout right in Inkscape then added the bars, circles and lines in the text editor because it was easy to get everything lined up, and to keep the code simple. The SVG file is here.

Figs 1.4 & 1.5 were made by other people.

Chapter 2: Translating numbers to images

Figs 2.1-2.6 are generated from base R graphics (see traindelays-4weeks.R. Because they are all showing variations on the same plot, there is one function defined at the beginning, with some options, such as to show markers or lines, and then this is repeatedly called inside png() and svglite::svglite() graphics devices. This ensures that there is consistency in things like margins and labeling, because they can be changed in just one place and affect all the graphs generated from that function.

The data was assembled from a few different Excel spreadsheets published by the Office of the Rail Regulator (now, the Office of Rail and Road). You can get my cleaned up version in traindelays.csv.

Figure 2.1 is intended to be a basic scatter plot (the SVG file is here), and Figure 2.2 a basic line chart of the same data (the SVG file is here). I like this time series and how it has served several different purposes in the book. Here, the occasional spikes in the line chart show that a line chart emphasises outliers by putting more ink on the page for them. Sometimes, that is what you want to do, but sometimes not.

Figure 2.3 is about highlighting (and overloading the reader), so it is the same as 2.2 but with extra markers or rectangles added in R. The SVG file is here. Figure 2.4 changes the encoding of variables to axes (the SVG file is here), and Figure 2.5 adds color coding to Figure 2.4 (the SVG file is here).

Figs 2.3-2.5 are pairs of images that I later decided to put together into one in Inkscape (to save page space). Nothing else was changed. Since I did that, the function multipanelfigure came along. That could have saved me a lot of time.

The images in Table 2.1 and figure 2.7 were just typed into a text editor. If you look at the SVG files 2-different-lines.svg and the various files beginning 2-table1... on Github, you'll see how simple the SVG can be to read and write, once you understand a few basics.

Figure 2.8 is made in Inkscape. I downloaded an SVG train icon from Wikimedia Commons (I chose a particularly stylized one), and repeated it in rows, cropping one, and then adding text. The SVG file is here.

Figure 2.12 was just SVG typed into a text editor; the SVG file is here.

Figure 2.13 was generated as repeated lines of text in R (see Ch2.R) that made up an SVG waffle. The SVG file is here, and if you look at it is the text editor, you'll see how each line is written out, changing only the x and y coordinates as it goes along.

Figures 2.6 and 2.9-2.11 are made by other people.

Chapter 3: Continuous and discrete numbers

All the R code is in Ch3.R.

Figure 3.1's strip chart or dot plot was generated with the R function stripchart(). I wrapped that in a function drawstripchart() so that parameters would only need to be specified in one place, making sure that SVG and PNG version match. The SVG file is here.

The histograms in Figure 3.2 were generated separately for 10 bins, 20 bins, and 50 bins, combined in Inkscape (the SVG file is here), and the common titles were lined up there.

The kernel density plots in Figures 3.3 and 3.4 use standard base R functions, with default Gaussian kernels and bandwidth set to the all-round recommended SJ option. The SVG files are here and here.

Figure 3.5 was combined in Inkscape from three plots generated by ggplot2::ggplot(). The plot with two histograms used geom_histogram and facets by city. The plot with superimposed kernel densities used geom_density with color set by city. The heatmap used geom_tile and encoded city to the vertical axis. These reflect three different ways of practically comparing two groups in your data within ggplot2. There are SVG files for the histogram, the kernel density, and the heatmap, as well as the combined final version.

Figure 3.6 is a straightforward violin plot; the SVG file is here.

Figure 3.8 is composed of four individual line charts with a particular quartile highlighted each time. The vertical bars in grey and red are added manually in Inkscape and the plots put together into one image; the SVG file is here, and the individual plots are named 3-small-change-... This is an example of the sort of composite dataviz that you could either create in one pass from complex code, or put together simpler plots with some manual editing. Often, the choice is driven by whether you are likely to use the code again with new data -- which would argue for complex code without manual intervention.

Figure 3.9 is the sort of composite plot where functions have already been written to make their creation simple for you (I used ggscatterhist()). The SVG file is here.

Similarly, 3.10 comprises a hexagonal bin plot and a contour plot, both of which can be obtained quite simply in R. There are SVG files for the hexbin, the contour, and the combined image.

Figures 3.11 and 3.12 are simple base R plots, except that the right part of Figure 3.12 is actually faked (R does not, except perhaps in foolish user-written packages, permit axis-breaking) by plotting the last point at y=26, then adding the pair of diagonal break lines and changing the label values below them on the y-axis manually in Inkscape. There are SVG files for Figure 3.11 and Figure 3.12.

Figure 3.7 was not made by me, but by Henrik Lindberg, who does a lot of other cool dataviz stuff. There have been many charts along these lines, most famously the album cover mentioned in the caption, but this one got a lot of attention when I was writing the book, is nicely designed, revived the format, and is quite accessible and interesting in itself.

Chapter 4: Percentages and risks

All the R code is in Ch4.R and the Stata code in Ch4.do.

Figure 4.1 started as a bar chart in Stata. I opened this in Inkscape, copied everything to appear again alongside, then replaced the bar with a series of rectangles and added annotation down the side. The three colors are generated from colorhexa.com as ternary colors, that stand out clearly together, though I then chose to brighten up the red somewhat. That red originally came from a leaf in my local park, sampled with the ColorGrab app on my phone.

Figure 4.2 is a very simple waffle plot. I felt that none of the R waffling options were simple and clean enough, so I just wrote a loop that writes out SVG code into a text file. Each pass through the loop is a square, and pulls in the relevant color (hex code). This got labels added in Inkscape.

Figure 4.3, a somewhat tongue-in-cheek illustration of a ternary plot, with a tip of the hat to an old Mars bar advert, was made entirely in Inkscape.

Figure 4.4, a simple bar chart with two bars, is entirely made in Stata.

Figures 4.5 (a stacked bar chart) and 4.6 (two clustered bar charts) were generated from Stata. The two parts of Figure 4.6 were combined in Inkscape and the font enlarged accordingly.

Figure 4.7 (two clustered bar charts) was made by removing the bars from the SVG of Figure 4.6 and adding new bar SVG code generated in R. I wanted to keep the rest of the graph identical in dimensions to Figure 4.6, although I lived to regret it. This is not an easy way to make a graph, and you'd only consider this kind of approach if you are writing a book!

Figure 4.8, a parallel sets plot, a.k.a. Sankey diagram, was generated from R; there are many options for this but I went with ggalluvial. You might like to compare it with the (to my mind) inferior version. Sorry, but the actual numbers of people moving in the caption is wrong, referring to the old (inferior) version. You get the idea, though, I hope. If not, email me!

Figure 4.9, a treemap, is created from R and then had labels edited in the SVG code.

Figure 4.10 is a tree containing three waffle plots. The waffles were made in R, then combined in Inkscape. The tree part was created there.

Chapter 5: Showing data or statistics

All figures except 5.4 were made in R. The code for 5.1 and 5.3 is in Ch5.R, and the code for 5.4, 5.5 and 5.6 is in traindelays-4weeks.R.

Figure 5.1 is simply a scatter plot where the data supplied are actually summary statistics. In this case, I just made up some stats to go alongside the previous example of Atlanta commuters.

Figure 5.2 is a base R histogram with more invented data. Systolic blood pressure really does have a very normal distribution in healthy populations, because of the many small factors bumping your SBP up or down at any given time. Because these are pretty much independent of each other, they act according to the Central Limit Theorem and provide a normal distribution with mean about 120 and SD about 12. I didn't save this rudimentary R code, but it would be something like:
hist(rnorm(1000,mean=100,sd=12),xlab="Systolic Blood Pressure (mmHg)",ylab="Number of people",breaks=20,main='')

Figure 5.3 is a scatterplot with lines added. This sort of plot is easy to construct in base R graphics. You might feel it's cool to use ggplot2, but if you need to make charts quickly for internal consumption / user-testing and then polish them up in SVG, it doesn't matter what you use for basics like this. If you prefer ggplot2 and find it easier to code, use it. If you prefer base, do likewise.

Figure 5.4 was drawn from a scientific paper, "A multi-professional educational intervention to improve and sustain respondents' confidence to deliver palliative care: A mixed-methods study", which I contributed to. This tracked health professionals as they took a course in end-of-life care. In this chart, their confidence is shown on a set of nine topics, across three time points: baseline, 3 months later and six months later. It was done in Stata with graph box. The colors are generated as shades of the logo of Princess Alice Hospice, which provided the training.

Figures 5.5, 5.6 and 5.7 return to the train delays data and show summary statistics from that dataset. 5.5 shows the boxplot (quartiles and extreme values) for each of the twenty years, alongside the mean (dotplot) for each year. 5.6 shows the quartiles as a line chart with a bolder central line, and the same design using the Winsorised mean and median absolute deviation. 5.7 shows two smoothed lines, splines and LOESS, and in this case new values are generated on a finer grid from the smoothing algorithm, and those new values are drawn.

Chapter 6: Differences, ratios, correlations

Figures 6.1, 6.2, 6.5 and 6.7 made in R and the code is here. Figure 6.6 was made in Stata and the code is here. Figures 6.3 and 6.4 were made by other people.

Figure 6.1 is simply a bar chart, and Figure 6.2 is a scatterplot with lines added to make the lollipop look.

Figure 6.5 comprises semi-transparent scatterplots of random data with correlations of 0.2, 0.6 and 0.9. Figure 6.6 does much the same thing but with the iris dataset, in a matrix layout. Figure 6.7 is the same thing, but replaces the scatterplots with blocks of colour that show the statistic of correlation rather than the data.

Chapter 7: Visual perception and the brain

Figures 7.2 and 7.4 were made in R and the code is here. Figure 7.1 was amended in Inkscape, and 7.6 was made in the text editor. Figure 7.3 is a screenshot and figure 7.5 and 7.7 were made by other people.

Figure 7.1 is the same bar chart as Figure 4.6 (left), but the axes were cut down, surrounding box removed, labels switched for color, and a little legend added, all manually in Inkscape. (You might be interested to compare Figure 4.6 (left), which was generated from Stata and is very lean, and what Inkscape does to it behind the scenes, adding lots of SVG code bloating, including the local directory name on my laptop! To avoid this, you would have to do the edits in the text editor.)

The three parts of Figure 7.2 are separate SVG files here: top, middle and bottom. The first two are completely made in R. The bottom image is a green line chart on a black background, with a subset of the same data superimposed as another line. Then, in Inkscape I selected the subset line, and applied the Inkscape filter called "Metallized Ridge" (in the Filters - Ridges menu) to that to make it look more physical, like a glowing neon light perhaps. (In the SVG file, note that you can see what filter was applied -- which might be a useful form of bloat in various ways.) That filter might or might not appear in your browser, depending on its support for various advanced SVG features.

Figure 7.3 was a screenshot from my (now defunct) webpage tracking birdseed consumption in my garden and musing on hypothesis generation and the interplay of inference and explanation. That page explained how the splines and the step rates are generated. If it is of interest, email me.

Figure 7.4 returns to the train delay data and just is a series of lines drawn on top of an empty plot region. In the R script, you'll see that I made a function called tukey() but didn't go so far as to really make it stand alone. Still, you could adapt it without much effort, by removing the hard coding, and making the default argument value a function of the range of y values. You'd be advised to get rid of the loop too, lest you offend R anti-loopers. Because it's here under The Unlicense, you can even pretend you made it all yourself.

Chapter 8: Showing uncertainty

All figures were made in R and the code is here.

My use of the term "shortcut formula" for asymptotics (standard errors and such) marks me out as a follower of the ASA GAISE guidelines and the associated efforts to reform basic statistics education. I am not dismissing asymptotics, but I think that people who are not steeped in stats, don't intend to become technical statisticians, but have to do some calculations, benefit from the simulation-focussed approach, and the bootstrap is a big part of that. The Locks' book, whence came the Atlanta commuting data, is an introductory freshman textbook that takes this approach.

Figure 8.1 is a scatterplot with some random data in two variables. This is drawn first with graphics::plot() and the bootstrap means are added as additional points(). I wrote a loop for(i in 1:iter){, which samples the data with replacement and calculates the mean each time. This shows you the basic bootstrap from first principles, so you can see what's going on. I think the bootstrap is a powerful tool that more datavizzers should be using. Normally, though, you would firstly use a bootstrapping package to make your code easier, and secondly use a better bootstrapping technique like bias-correction (these are options that come bundled in packages). The bootstrap points are semitransparent: the color hex code is #cd4c4c15, the last two characters providing a byte value for opacity, from 00 (completely transparent, or in other words, invisible) to ff (completely opaque). 15 is pretty low (it's in hexadecimal numbers, remember, so it's actually 21 out of 255) so the markers are just barely there.

Figure 8.2 is pretty much the same thing, but with splines through the data, so the bootstrapped results are added lines(). In draft, I had a note on how these random data represented the sort of pattern in two dimensions that splines do not help us understand, and that in fact, visualising them helps us to see that shortcoming. Note, I wrote with glee perhaps not shared by any potential reader, that at left and right ends of the chart, the splines were obliged to pass through the data, which were unusual outliers, even if that enforced wacky polynomial shapes elsewhere. But, that paragraph got chopped. I'll just point it out here to avoid more experienced readers thinking I'm a blundering buffoon.

Figure 8.3 is a specialised plot that is used in clinical performance indicators, which I don't think is available in any packages in exactly the way I wanted it, so it is assembled here from scratch. This is the sort of visualisation that requires a lot of statistical work between data and image, and you should get a statistician involved to make sure it is done right.

Figure 8.4 was assembled from three separate plots, and the individual SVG files are here: left, middle and right. In the left image, I made an empty plot() with invisible markers at the extremes to get the size I wanted, then added many semitransparent (#00000033) lines.

In the middle image, I amended the left image in Inkscape, selecting lines near the extremes (I forget if it was the extreme, or the 3rd one in, or whatever), adding an arc at the top right end, removing all other lines except the middle one (median path), converting the paths to a polygon and filling it with color. This is a convex hull approach but done very crudely (Figure 8.4 was one of the last, if not the very last, to be made, and the clock was ticking!). You'd be better advised to program the selection of the nth line in from the extreme and then create the same look (ggplot2) will probably be easiest for that).

In the right image, the data points that make up the lines are treated as a cloud of points, and the contour plotting functionality in ggplot2 is used to obtain the shape we want. You might notice that it extends beyond the bottom left, like a kernel density. This might be problematic, depending on context. Hurricane paths or unemployment forecasts, for example, do not have uncertainty extending backwards in time.

Chapter 9: Time trends

All figures were made in R and the code is here. Figures 9.2, 9.4, 9.5, 9.7 and 9.8 were made by other people.

Figure 9.1 is a connected scatterplot where the color of each part of the line is determined by the antiquity of the data. There are also series for Scotland and Non-Scotland, so that make four dimensions or variables. I made it aeons ago, and the presence of an integer called frametime rather suggests that I might have considered making an animation (though I never did). The code is here and the data are here. The data are in the public domain, via Wikipedia possibly. The three components are converted to a horizontal and a vertical dimension. The markers for elections, and the lines joining them, are added incrementally by loops, changing the colour each time. There may be a more loop-averse approach. The smooth curves are made by getting splines of the horizontal dimension (Lab v Con) through time, and the same for the vertical dimension (Two-Party System v Others) through time, then sticking the predicted horizontal and vertical values together. Another example of this use of splines is in this video.

Figure 9.3 was assembled in Inkscape. One mistake I made there was transforming three existing line charts (imported for some reason as raster images) with a huge vertical stretch, which thickened the lines in a vertical direction, so you get the look of a calligraphy pen, if you know what I mean. Not a disaster, but a simple mistake impacting slightly on style. The constituent parts were three line charts (SVG files: HIV, nukes and ozone). Data for HIV and nukes were downloaded (from the World Bank and the Bulletin of Atomic Scientists respectively) and tidied into CSV files. Ozone data were extracted from this image file from the British Antarctic Survey. You can see how in the R script; because we just want a spline through the points, we don't have to identify one point for each + symbol, instead we can just flag each dark pixel. Country outlines were obtained as SVGs from Wikimedia Commons. The whole thing was compiled in Inkscape.

Figure 9.6 is an standard statistical autocorrelation plot, obtained from the train delays data (code here: a one-liner plot(acf(trains$london_se,ci=(-1)),main='')).

Chapter 10: Statistical predictive models

Figures 10.1 and 10.2 were made in Stata and the code is here. Figures 10.4, 10.5, 10.6 and 10.8 were made in Stata and the code is here. Figures 10.3 and 10.7 were made in R and the code is here.

Figures 10.1 and 10.2 make some random data, split it into training and test datasets, fit a model to the training data, get predicted values from that model for the y-values in both the training and test datasets and many interpolated points, and then draw the original data as navy blue markers, the predictions for observed values as hollow orange markers, and the predictions at interpolated points as orange lines. The RMS error is added in a caption.

Figure 10.3 shows some random data on the left and the RMS errors for increasingly complex polynomial models on the right. Polynomial models are generally a bad idea, but here they are helpful for illustrating the general point, because they can be understood without much mathematical nicety, and they are nested inside each other. This is a subject expanded on with great clarity in Hastie and colleagues' book, "The Elements Of Statistical Learning".

Figure 10.4 is a dot plot where each regression coefficient is a dot, with the horizontal location representing the value of that coefficient. Confidence intervals are shown as error bars, and a vertical reference line at zero shows whether they are associated with a beneficial or harmful effect. Recall that logistic regression gives coefficients (betas) which can be interpreted as logarithms of odds ratios, so here, to be converted to odds ratios, they need to be exponentiated, and their confidence intervals too, which consequently become asymmetric. This uses the IMPACT trial data.

Figure 10.5 and figure 10.6 are marginal effects plots, which show effects and confidence intervals like figure 10.4, but this time the software is integrating out the effects of the other variables in the model as they interact or confound the predictor of principal interest (here, length of treatment). This also allows us to convert to risk ratios.

Figure 10.7 is, left and right, just a scatterplot with binary outcome variable or error encoded to marker colour, and an added line (hardcoded from the known relationship in the data) at 50% chance of a black/white marker.

Figure 10.8 is a standard statistical chart available in many software packages.

Chapter 11: Machine learning techniques

I originally made figures 11.1 and 11.5 (in R) under contract to DataCamp. It basically involved ggmap for the left and ggplot2 for the right, both in R, of course. I encourage you to check out ggmapstyles too. I don't work with DataCamp any more. Figures 11.2, 11.3, 11.4 and 11.6 were made in R and the code is here. Figures 11.7 and 11.8 were made by other people.

Figure 11.2 is a scatterplot with a third (outcome, dependent, target or endogenous variable, depending on your background) variable encoded as colour, then tree partition lines and labels added in Inkscape.

Figure 11.3 is similar but shows a regression tree, so the outcome variable is a continuous value and is encoded to marker size on the left image. In the right image, the residual (error) value is encoded to marker size instead. Here, the partition lines are added inside R.

A couple of sparklines turn up on the same page as Figure 11.3. The book was typeset in Latex, and there is a sparkline package which you can add, allowing you to draw these. This is what the actual Latex code for that paragraph looks like (the numbers are just made up to create a certain image):
Just as we could have \index{scatter plot}marginal densities on a scatter plot (Chapter 3), we could also add marginal plots of residuals versus predictor values (perhaps with a \index{smoothing}smooth regression line like LOESS through them). A random scatter \begin{sparkline}{13} \sparkdot 0.50 0.79 blue \spark 0.0 0.97 0.08 0.26 0.17 0.49 0.25 0.67 0.33 0.42 0.42 0.39 0.50 0.79 0.58 0.91 0.67 0.52 0.75 0.83 0.83 0.17 0.92 0.33 1.0 0.08 / \end{sparkline} (with the blue dot at the location of the branch) would indicate that the tree found reasonable points at which to branch, while a change at the point of the branch \begin{sparkline}{13} \sparkdot 0.50 0.56 blue \spark 0.0 0.23 0.08 0.25 0.17 0.10 0.25 0.14 0.33 0.21 0.42 0.25 0.50 0.66 0.58 0.74 0.67 0.93 0.75 0.83 0.83 0.76 0.92 1.00 1.0 0.89 / \end{sparkline} would suggest that maybe trees are not a good way to model these data.

Figure 11.4 was originally generated from the tree package in R, but required considerable tweaking in Inkscape afterwards to look at all good, so in my opinion it would be better to forget tree for this purpose and just draw it from scratch yourself.

Figure 11.6 is a scatterplot where markers within a certain radius of the centre are coloured red (that red leaf from the park, in fact).

The Tensorflow Playground appears in Figure 11.7, and is at playground.tensorflow.org.

Chapter 12: Many variables

I made figures 12.3 and 12.10 in R, and the code is here. Figures 12.7 and 12.9 (all using the Saturn-shaped random data), were also visualised in R, with the code here. Figure 12.11 was made in Stata (the code is here), using Adrian Mander's radar command. Figures 12.1, 12.5 and 12.8 were made by other people.

Figure 12.2 is a screenshot of a chart I whipped up quickly in a well-known spreadsheet package made in Redmond, WA. It would be rude to name it; similar presets are available in almost any entry-level number-crunching software you can name, and to be fair, this particular one has made the 3-D bar chart less prominent in the gallery of formats as time has gone by. I don't recall exactly what I clicked -- and I shouldn't tell you even if I did -- but suffice to say, it only took a few clicks.

Figure 12.3 is the standard go-to dataset for 3-D graphics examples in R, called volcano, which gives altitude values over a grid of latitude and longitude. It's almost embarrassing in its ubiquity, but actually demonstrates the problem of obscuration, while also having an interestingly rough surface, and enough familiarity to make visual understanding quick.

Figure 12.4 was made in R a few years ago while I worked on research into tuberculosis of the eye. I have no SVG or bitmap for it (because of patient confidentiality). I generated a distance matrix between patients, ordered by hospital, and just pushed that into the base R function image(). You could, of course, make a prettier one by adjusting the colour scale.

Figure 12.6 is a photograph. I mention it here just for completeness. I obtained the pitta from Co-op Supermarket in Addiscombe and ate it immediately afterwards with some taramasalata. Hand: model's own. Anglo-centric tips for teachers: Pittas are quite good for teaching dimension reduction because each orthogonal axis through the bread produces a slice with quite different range or variance. Baguettes have one very long axis and two much shorter and more similar in bread-range. You might be able to find some kind of *pain rustique* (Brexit notwithstanding) that gives a similar effect to the pitta, while also being visible to a larger roomful of rapt learners, though you should be careful not to give the impression that you are being paid too much with your fancy loaf.

Figures 12.7 and 12.9 are just scatterplots; the projection is the important thing here: it is manually manipulated in 12.7 (see lines 52-54 in the code) and determined by dimension reduction techniques in 12.9.

Figure 12.8 is from Michael Greenacre's book Correspondence Analysis in Practice. Michael pointed out that may caption mangles the definition somewhat. The vertices of category space are not projected at the points seen in this symmetric map (for example, the one marked W(full-time)). They are shrunk towards the origin more than the data points are. You can image the category space being quite large, and the countries not radically different to one another, so scattered around the same region. If we projected everything at the same scale (an asymmetric map), we would get all the data crammed together in the centre of the image, which would be unhelpful for readers. So, we cannot say, for example, that AUS is more keen on w(part-time) than NZ. It is closer to the projected vertex point, but the location of the vertex is not accurate -- it is the direction of the vertex point you should consider (so, NZ is more inclined to w(part-time) than AUS because they are further out from the origin in that direction). Also, RP is the Philippines, not Romania as I seem to have got into my head somehow. If you want to read more about this underused technique, I very much recommend Michael's books, which are superbly clear and practical.

Figure 12.10 is a pretty standard dendrogram using a built-in R dataset, USArrests.

Figure 12.11 is particularly easy to do in Stata once you run ssc install radar. The data are here. There are, of course, plenty of ways to achieve radar plots in other software.

Chapter 13: Maps and networks

There's one error here on page 171. I refer to pie charts superimposed on a map in Figure 13.5 and bubbles in Figure 13.4. Somehow, when I updated all the figure numbers, I missed those ones. The pie charts got dropped (you can imagine it pretty well without seeing it), and the Hampshire map moved in before the bubbles. So, it should read "One often sees pie charts and bubble charts (Figure 13.5) superimposed like this...".

I made figure 13.2 in Mapbox, figure 13.7 with some R, some OpenStreetMap, smashed together in the GIMP, and I made figure 13.11 with sweet JavaScript.

I made the two parts of Figure 13.2 in Mapbox. Using the Studio feature, you can select various layers, like the boundary of water, the text labels of towns of a certain size, and so on, and make them visible or invisible, and change various stylistic elements like colours and fonts. I recommend Mapbox as an affordable tool allowing you to get to grips with mapping data, without having to invest the time in learning a fully-fledged GIS software package. In this instance, I turned off everything except place names for cities, motorways and major roads, and water. I exported them as raster images, so there is no SVG.

Figure 13.7 is an image I made a few years ago, when I found that the John Snow cholera data had been put online by Robin Wilson. I made a hexagonal bin plot of latitude vs longitude (with the hexbin package in R), noting the corner values, then lined that up with a screenshot of contemporary Soho, London, in OpenStreetMap. So, it's all a bit bodged because of its antiquity. Nowadays, I'd suggest making it in R with ggmap and adding a geom_hex() layer over the top. Darker hexagons indicate more deaths. At the time, I tinkered with the hexagon size and decided on this one. Although it is small enough to break up the overall blob around the Broad Street pump, it has the advantage of showing individual streets, which makes it more personal and real for me, including some streets that don't exist any more.

Figure 13.11 is a screenshot of a webpage constructed in D3 (more on that in the Interactivity chapter). It used to serve as my CV or resumé, a rejection of paper templates and a cringingly blatant attempt to appear hip and tech. (Yes, I know that radial networks and edge bundling are sooo 2012, but it was, in fact, 2012.) It was, in large part, developed from Mike Bostock's example, a smarter version of which can be seen here.

Chapter 14: Interactivity

Figures 14.1 and 14.2 were made by other people.

The State Of Obesity report in its latest form can be found at stateofobesity.org/adult-obesity. It's not as cool as it used to be.

The home pages for D3 and Leaflet are at d3js.org and leafletjs.com.

"Bussed Out" is at theguardian.com/us-news/ng-interactive/2017/dec/20/bussed-out-america-moves-homeless-people-country-study.

"How the Recession Reshaped the Economy, in 255 Charts" is at nytimes.com/interactive/2014/06/05/upshot/how-the-recession-reshaped-the-economy-in-255-charts.html.

Amanda Cox's porcupine chart in "Budget Forecasts, Compared With Reality" is at archive.nytimes.com/www.nytimes.com/interactive/2010/02/02/us/politics/20100201-budget-porcupine-graphic.html. You might find you can't view it without Flash, and gee, who has Flash anymore? It would be nice if there were screenshots of it... (pssst... I have some. Come to one of my courses and you will see them in Fair Use. No, I can't send them to you. They are the intellectual property of the NYT.)

The Tensorflow Playground appears in Figure 11.7, and is at playground.tensorflow.org.

Paul Lambert's splines are at le.ac.uk/hs/pl4/spline_eg/spline_eg.html.

Wattenberg and Viégas' t-SNE interactive is at distill.pub/2016/misread-tsne/.

Rasmus Bååth's Bayesian t-test is at sumsar.net/best_online.

StatKey is at lock5stat.com/StatKey.

You can learn more about Shiny at shiny.rstudio.com.

Chapter 15: Big data

Chris Whong's 2013 data are explained here, and there are links to download them. I made Figure 15.1 in Stata as part of a talk at the London Users' Group meeting in 2016. An illustrative (non-C++) do-file is here. This is the same dataset I sampled from for the DataCamp project mentioned in Chapter 11.

Figure 15.2 is by Oliver O'Brien and the original website is here.

Chapter 16: Wrapping up the package in a report, dashboard or presentation

The only images in this chapter that I made are Figures 16.2 (the train delays poster) and 16.3 (quality of life in art therapy for dementia).

Figure 16.2 is made from three images in Chapter 2. The SVG images were placed into a large Inkscape canvas, along with the raster image of the red leaf, and accompanying text was added there. Note that the file is pretty big for an SVG, because it includes the wasteful raster information. I tried to balance the information to keep it as light as possible, while also allowing readers of low statistical literacy to delve into the patterns in the data. More recently, I found one of several rough sketches, which you can look at here. You can see some ideas that were dropped, like a map of railway line radiating from London, and an idea of the scale of the problem (GU2, see Chapter 1). You can read more about the decisions in Chapter 16.

Figure 16.3 came from the report of a study called RADIQL, which followed art therapists conducting reminiscence sessions for people with dementia in nursing homes. I did the statistical analysis. The model here was a multilevel regression, which you might call panel data if you had some econometric education. There is a linear change over the weeks in both arms of the study, and a quadratic function of time within each session and immediately afterwards, which gave an idea of benefit plateauing and let us look at a rebound/anticlimax effect too. Quadratics are generally are pretty bad idea, but this was relatively easy to explain to the audience and within the circumscribed timescale, didn't zoom off (like you saw in Figure 10.2). There is no SVG for this figure, but here's a PNG.

"Data Viualisation: charts, maps and interactive graphics"

Click here to view the Table Of Contents.

The making of...