Non-parametric Bayesian updating

This page explains my work on this topic and is a repository for code and extended outputs that could not fit into publications. The code is ©BayesCamp Ltd 2021, and licensed for use CC-BY-4.0. The text and images in this web page are also ©BayesCamp Ltd 2021, and are NOT licensed for reuse. You may use the code as you wish, as long as you acknowledge the source, but you may NOT distribute or publish the rest of this web page in ANY form without the explicit written consent of BayesCamp Ltd. Send requests to robert@bayescamp.com

Assumed background knowledge: Theory and practical experience of Bayesian analysis, R, Stan, matrix algebra, classification/regression trees. The section on ensembles additionally assumes understanding of bagging and boosting (to the level of reading EoSL). The section on rotated kudzu ensembles as a URINE model uses a little Lie algebra terminology but anyone with the above prerequisites can get by after reading the BayesCamp white paper (coming soon).

The Creative Commons license means that you can adapt and use this for your own work, you can share it and make money out of it, just so long as you name BayesCamp Ltd as the source of the original algorithm. If you find it useful and want to say thanks, you can click on Buy Me A Coffee below and do just that. I am a freelancer and while I develop ideas like this, I am not earning. If you want to hire BayesCamp to help you integrate these ideas into your work, head on over to bayescamp.com.

Table of Contents

  1. Primer
  2. Formulas for the kudzu algorithm
  3. Software for the kudzu algorithm
  4. Bayesian windowing and moving weighted averages for streaming data and data drift
  5. Extended material from the 2021 IJCEE paper
  6. How to use kudzu ensembles with bagging and boosting
  7. Basics of Bayesian inference for rotations and the URINE class of ensemble models; application to kudzu
  8. Further research ideas

Primer

Bayesian updating is the principle that you can add data sequentially in batches to learn a posterior distribution. If done correctly, the result will be the same as if you had analysed all data together.

In my experience, teachers of intro Bayes courses talk about this all the time. Then it never gets used in real life. Why not?

It is actually much harder than it seems in simple one-parameter examples with conjugate priors. A prior is actually a joint density over all the parameters and other unknowns in your model, even though probabilistic programming languages like BUGS, Stan and PyMC* have drawn us into thinking of it as a series of marginal distributions, one for each parameter. Parameters are generally correlated with each other, so the joint specification cannot be reduced to a set of marginal distributions without loss of the correlation information. Another problem is that multimodality or non-linearity might be masked by the marginal distributions.

Parametric updating is possible if you correctly know the joint distribution. You can specify hyperparameters for the next prior by analysing the previous posterior. In the case of a multivariate normal prior and posterior for p parameters, this would mean a vector of p means and a p-by-p covariance matrix.

But generally, we do not know the distribution. That's why we're doing the analysis.

A non-parametric updating method would use the posterior sample to inform the prior density. In this sense, we have a prior sample which is also the previous posterior sample. Because these samples can have a dual purpose, I call them p[oste]riors.

Kudzu graveyard (51748)

A winter scene of kudzu growing over bushes and trees at Floyd Bennett Field, New York. Image by wikmedia user "Rhododendrites", CC BY-SA 4.0, via Wikimedia Commons

At heart, this is a density estimation problem. Given the prior sample, we need to be able to obtain a prior density at any proposed parameter vector. I think of this as a surface in p-dimensional parameter space. The classic method is kernel density estimation, but this will not work for anything but the smallest p, maybe p<5. This is long established (viz Azzalini) and I demonstrated it in this context in the IJCEE paper.

Instead, I devised the kudzu algorithm, which uses density estimation trees (DET), a fast and scalable method which uses the same tree-growing algorithm from classification and regression trees, but to estimate a density. The DET estimate is piecewise constant, which will not help with Bayesian sampler algorithms such as Hamiltonian Monte Carlo (implemented in Stan and PyMC*), Metropolis-adjusted Langevin Algorithm or Piecewise Deterministic Markov Processes; these samplers need to obtain and use gradients of the density (the Jacobian vector) and the estimated density should be smooth.

To smooth the DET estimate, I replace the edges of leaves (you might say I "convolve" them) with inverse logistic ramp functions. The formulas and code are below. Evaluation shows kudzu performing with comparable fidelity to kernel density in low-p problems, and performing quickly and accurately in p up to 50.

graph

A one-dimensional illustrative set of two leaves (solid lines) and a corresponding kudzu density (dashed curve)

Formulas for the kudzu algorithm

DET returns a collection of leaves, each of which has a top and bottom value in each dimension, and a density. This is like a collection of multivariate uniform densities (which are zero outside the leaf). Imagine the complete DET density estimate as the sum of all these leaves' multivariate uniform densities. We will construct kudzu density in the same way, as a sum of leaves.

The kudzu leaves have uncorrelated multivariate densities, so we can construct those in turn as the product of univariate densities. Each of those univariate densities is the product of two inverse logistic ramps.

Statistically minded readers might get some extra intuition from the fact that the inverse logistic ramp is the logistic distribution's CDF. Its derivative (the logistic PDF) is the product of two inverse logistic ramps sharing a common midpoint (where their ordinate is 0.5), but facing in opposite directions. If we move those two ramps further apart or closer together, and potentially alter their steepness, we have a univariate component for one kudzu leaf.

This is the notation in the formulas:

  • θ is a vector of proposed parameter values, evaluated by one interation of a sampling algorithm
  • ℓ is a natural number identifying one leaf of the DET
  • f_DET(ℓ) is the density estimated by DET for that leaf
  • j is a natural number identifying one of the parameters (one dimension of the distribution)
  • μ_b is the location of the bottom of the leaf in the jth dimension; μ_t is the top
  • σ is a scaling factor for the steepness of the ramp, a bandwidth controlling the tradeoff between smoothness and detail (also, if sigma gets too big, it will inflate the marginal distributions)
  • δ_b is a distance by which we move the midpoint of the ramp either side of μ_b, to control variaince inflation; δ_t is the same for μ_t; I call this delta-translation
  • φ is the number of sigmas on either side of μ_t+δ_t and μ_b+δ_b, up to which we evaluate the integral

In practice, I found that there has to be delta-translation towards the mode of the density, in order to control variance inflation. Heuristically, I did this in the IJCEE paper by finding the centre of the leaf (hypercuboid) of highest predicted density; this is the approximate mode. Then, each edge of each leaf is moved towards it by delta, times the sine of the angle between the edge and a line joining the centre of that edge to the approximate mode. In this setting, δ_b=δ_t.

algebra

The inverse logistic function



algebra

The kudzu density is the sum of the leaves' densities. Each of those is the product of the univariate components, with each one normalised by dividing by its integral. The numerator and denominator are defined below.

algebra

The numerator of the kudzu density in the jth dimension of the ℓth leaf. Because this will not generally integrate to one, it is not a probability density until it has been normalised. f_DET(ℓ,j) is the marginal density of the DET density for that leaf in the jth dimension.

algebra

The denominator of the kudzu density in the jth dimension of the ℓth leaf, the definite integral over [μ_b+δ_b-φσ, μ_t+δ_t+φσ], is equivalent to setting σ=1, μ_b+δ_b=0, μ_t+δ_t=z_ℓ, and then evaluating over [-φ, z_ℓ+φ]. Writing it in this form leads to the limits below.

If a particular dimension of a leaf is wide enough, relative to the bandwidth, to have a plateau in the middle, then the integral simplifies to the area under the rectangle between ramp midpoints. Let w_ℓ = μ_t+δ_t-μ_b-δ_b be the width of the leaf, and z_ℓ = (w_ℓ)/σ the width in units of σ in the jth dimension. I suggest z_ℓ>12 as a rough guide.

Two further considerations remain. First, kudzu, being based on orthogonal partitioning of the parameter space, will work best on uncorrelated priors. (Ensembles of kudzu densities may help, see below.) So, I standardise the prior sample by subtracting means and Hadamard-dividing by standard deviations. Then, I apply principal components analysis (PCA) rotation (premultiplication with the eigenvector matrix), and finally I rescale again by Hadamard-division by standard deviations of these rotated variables. Kudzu is applied to this. Inside Stan (or other sampling algorithms), we estimate the prior density using this, then back-transform the proposed parameter vector into the parameters of the likelihood, and evaluate the likelihood.

Secondly, we must apply a small gradient outside the prior sample bounding box (smallest hypercuboid containing all prior draws), in order to guide the sampler back to the region of highest density. I do thins by mixing 0.98 times the kudzu density with 0.02 times a multivaraite normal density, centred on the approximate mode and with standard deviation twice that of the largest principal component standard deviation.

algebra

The denominator of the kudzu density in the jth dimension of the ℓth leaf, if it is wide enough to allow for a plateau and hence no interference between the two ramps.

Software for the kudzu algorithm

I used the MLPACK C++ library's command line interface (CLI) to DETs for the IJCEE paper, then wrote my own DET code in R and C++ for boosting. MLPACK CLI provides output that kudzu requires, which is not so for the R interface (I have not looked into Python or Julia). Nevertheless, it has to be written into an XML file and then parsed.

I provide an R function called kudzu_det here (CC-BY-4.0), which runs MLPACK DET, and then applies the delta-translation and convolution. It requires the R packages readr and xml2, as well as an installed MLPACK CLI. It receives a matrix containing a prior sample, and returns a list containing, among other things, matrices of delta-translated tops and bottoms, and a vector of leaf densities. These are supplied to Stan.

An important first step in using kudzu is to run Stan with only the prior (no likelihood), which allows you to compare the original p[oste]rior sample with the resulting posterior. We want to adjust delta and sigma until they are similar. It may not be possible to get an entirely satisfactory correspondence, because we are applying a tree and then smoothing, so there will inevitably be some distortion. The NYC taxi example in the IJCEE paper (see below) provides a worked example of this.

To see the PCA standardisation, rotation and rescaling in action, read the New York taxi R file (also described under IJCEE below), which includes the Stan model code. For your convenience, you can also download the Stan models separately for prior plus likelihood and for prior only.

The flowchart below might help to clarify how the objects are created and passed around.

flowchart

Bayesian windowing and moving weighted averages for streaming data and data drift

In theory, we can use Bayesian updating to get statistical inference not just on accumulating data but also on windows of time. I suspect that the value of Bayesian updating lies mainly in rapidly arriving data (when Big Data people talk about the n Vs, where n is a number unrelated to reality, they usually mention Velocity; voilĂ !). In that case, there is little point to inference at all once you have accumulated several billion observations. But it is important if you want a moving window of data.

Akidau provides a simple overview of principles here.

If you have fixed windows (see Akidau's figure 8), then each batch of data is in one window. Each window updates the previous ones by taking their posterior and using it as its own prior. If you want to remove an old window, you get its posterior from the archive and subtract its kudzu log-density from your log-prior at each sampling algorithm iteration. But then you have also removed the original a priori priors, and so you should add them back in. However, this adding and then subtracting accumulates distortion.

So, I think in most cases it will be better to have parallel sequences of updates rather than a single one that has to have p[oste]riors subtracted as well as added. Suppose you are adding new taxi journeys to your model every day. Each window is one day, and you want analyses to be based only on the previous week. It is 7 Jan. I accumulate 1 Jan to 7 Jan in one updated model, and in parallel, 2 Jan to 7 Jan, 3 Jan to 7 Jan, and so on. The final one only has one batch of data in it, 7 Jan. My inferences are based on the full set of weekly updates, which is 1 Jan to 7 Jan. Now, 8 Jan's data arrive. I discard the 1 Jan to 7 Jan analysis, add 8 Jan to all the others, create a new set of updates containing only 8 Jan for now, and use the 2 Jan to 8 Jan set for inference. I need only have more hardware to run seven sets in parallel.

With sliding windows (see Akidau's Figure 8), the collection of parallel updates wil be governed by the overlap between the windows as well as their length.

We could also apply a weighted window, to reduce the influence of old data. This is analogous to the evergreen exponentially weighted moving average. It would simply be implemented by adding a (negative) log-scaling factor to the kudzu priors. As they got older, they would be suppressed by ever more negative log-scaling factors. However, there is to my knowledge no statistical theory out there about such a weighted posterior, and this should be developed before practical application.

Extended material from the 2021 IJCEE paper

This is where you can find the code and extra graphics from the paper "Non-parametric Bayesian updating and windowing with kernel density and the kudzu algorithm", in the International Journal of Computational Economics and Econometrics, 2021.

The journal does not permit sharing of pre-prints. The post-print is embargoed until one year after publication, when I will add it here. In the meantime, you can obtain an paper copy by going to my Buy Me A Coffee page, sending me a small sum to cover printing and postage, and in the comments, tell me that you want a copy and provide a physical postal address to post it. I will send it by mail. I am not allowed to send digital copies.

Abstract

The concept of "updating" parameter estimates and predictions as more data arrive is an important attraction for people adopting Bayesian methods, and essential in big data settings. Implementation via the hyperparameters of a joint prior distribution is challenging. This paper considers non-parametric updating, using a previous posterior sample as a new prior sample. Streaming data can be analysed in a moving window of time by subtracting old posterior sample(s) with appropriate weights. We evaluate three forms of kernel density, a sampling importance resampling implementation, and a novel algorithm called kudzu, which smooths density estimation trees. Methods are tested for distortion of illustrative prior distributions, long-run performance in a low-dimensional simulation study, and feasibility with a realistically large and fast dataset of taxi journeys. Kernel estimation appears to be useful in low-dimensional problems, and kudzu in high-dimensional problems, but careful tuning and monitoring is required. Areas for further research are outlined.

Prior-only analysis

I applied the various methods to samples from these three illustrative two-dimensional prior distributions. These scatterplots of kudzu posteriors under different bandwidths (sigma tuning parameters) for the lasso prior do not appear in the paper.

The kernel and SIR methods are applied in this R file (along with functions for the three prior log-densities and random number generators), and the kudzu algorithm in this R file.

Some of the most interesting output is in these scatterplots of successive posteriors, as we keep updating repeatedly, for every combination of method and distribution (only one of these made it into the paper).

I also summarised distortion using the 2-sliced Wasserstein distance in this R script, then I got the sampling distribution of these Wassersteins under a null hypothesis of pure random sampling thus, and plotted them thus, leading to this image, which was alluded to in the paper, but not printed. You might notice from this that I had parallel folders for the various kudzu bandwidths as proportionate to Sheather-Jones.

LASSO prior, repeatedly updated without likelihood, by all the methods

LASSO prior, repeatedly updated without likelihood, by all the methods - click to enlarge. Compare with the banana distribution and the bivariate normal.

Simulation study

I simulated data for a logistic regression model with three parameters, and ran them through ten updates of each of the methods (except kudzu). This ran in a cloud facility virtual machine with 16 cores, so that four instances of R run in parallel, each running its own simulations in folders instance1, instance2, instance3, instance4. This R script executes one of the instances for the kernel methods, while this R script does the same for SIR.

This R script collects simulation study outputs together. The first 30 updates are shown, for each method and each parameter, in this zip file of the Craft charts. They were generated by this R script. The bold curve in each panel is the all-data posterior, which acts as ground truth. The successive posterior means across all simulations are shown in these line charts.

zoomed in section of craft chart

A Craft chart is named after Harold Craft, an astronomer who used a plotter to visualise periodic signals from the first known pulsar. In line with Stigler's Law of Eponymy, these are generally known as Joy Division charts, joyplots, ridgeline plots, &c &c. This zoomed-in section appeared in the paper. The first update's posterior is at the top of each panel, and subsequent updates lead to posteriors moving downward. The bold curve is the all-data (ground truth) posterior.

Feasibility case study: New York taxis

To test out SIR-A and kudzu on a realistically sized problem, I used the first seven days from the 2013 New York yellow taxi dataset, which was originally obtained and shared by Chris Whong. You don't have to download those files to reproduce my analyses; I offer a tidier format below.

My first step is to process each day's "trip file" and "fare file" into a single "taxi data" file. There are the usual GPS errors, for example, and some storage-hogging variables that are discarded after merging. This is the R script for making the taxi data files. But if you want the seven taxi data files, you can grab this zip file (32MB).

Then, I used this data2mats function in R, which will transform a taxi data file in CSV format into a neat collection of R matrices and vectors. This is the data for a given day.

The initial prior is a sample of 210,000 journeys (30,000 from each day). The R code that follows grabs this from the SIR output list, so you can use this to run a reproducible analysis with the same pump-priming prior sample that I used.

The SIR-A analysis (including data prep) is in this R script (there is no Stan involved for SIR). The output is all in one .rds file. The issue of degeneracy is illustrated by this scatterplot of proposed parameter values from SIR-A, looking at two parameters. The red dots are prior draws and we can see that SIR-A is just samping around them according to the kernel. The kernels overlap marginally but not jointly. Another example is in this set of scatterplots looking at the parameters for Monday and Friday over successive updates. The red polygons are convex hulls of the next p[oste]rior. In this case, degeneracy is not obvious (multimodal) until the likelihood curvature has become comparable to the prior curvature, after a few days.

This R script runs kudzu updates on the taxi data. The Stan models are given separately for prior plus likelihood and for prior only.

The p[oste]rior means after the n=210,000 pump-priming and after the 1 Jan kudzu update are shown together in this dotplot. The means after pump-priming are hollow circles and those after 1 Jan kudzu are solid black circles. The parameters are in this order:

  • nine rectangular regions west of 5th avenue, from downtown to uptown, ending at 110th Street; the most southwesterly is omitted as baseline
  • ten rectangular regions east of 5th avenue, from downtown to uptown, ending at 110th Street
  • seven days from Monday to Sunday; because none is omitted as baseline, these function as intercept terms
  • 23 hours from [01:00, 012:00) to [23:00, 00:00); the first hour [00:00, 01:00) is omitted as a baseline
  • the residual standard deviation

The following differences are notable and can be explained to give some reassurance that the update makes sense. Yellow cabs usually have higher fares from midnight to 5:59 by NYC regulations, but this does not apply on 1 January, a public holiday. So, when 1 Jan (n=316,446) is added (so to speak) to the pump-priming (n=210,000 equally drawn from all seven days), we see the drop in fares at 06:00 greatly attenuated, likewise the bump in fares for an evening rush-hour. The Monday intercept term changes somewhat while the other days do not (modulo Monte Carlo error).

The posterior means after 1 Jan are compared to the prior means here (and they look almost identical), while the SDs are compared here (and they look inflated; the dotted line is double inflation). Succcessive updates don't get much worse, and the inflation is not monotonic over updates, so it may be the early updates that are problematic. Marginal variance inflation is one of the outstanding challenges in appying kudzu effectively. There is much scope for experimentation with adaptive approaches to bandwidth specification and delta-translation.

The effect of bandwidth on posterior variance and covariance is investigated in this R script. Some examples of graphical consideration of the kudzu bandwidth (sigma tuning parameter) are shown in these outputs. We can see when chains diverge because there is not enough gradient within a leaf to guarantee that the chain can escape (sigma=0.008 is too small).

The all-data analyses, which I ran for alternate days, is given in this R script. Posterior stats are compared in this R script.

How to use kudzu ensembles with bagging and boosting

Coming soon

Basics of Bayesian inference for rotations and the URINE class of ensemble models; application to kudzu

Coming soon

What is to be done? Improvements and further research ideas

I have examined kudzu in a mostly if not wholly convex posterior problem, up to p=50. We need to extend that into higher dimensions and trickier posteriors.

My heuristic approach to delta-translation is probably OK with unimodal, convex p[oste]riors. But we need something more flexible. The challenge is that a leaf might adjoin, on the same face, leaves that have a mixture of higher and lower densities than itself. Move towards the higher one and you might inflate variance into the lower one. Move away from the lower one and you might leave a canyon (gully? donga?) between the current leaf and the higher one, which could cause non-convergence, reduced stepsize, all your favourite problems.

SIR-A was tested at p=50 in the IJCEE paper and it flopped badly with degeneracy. However, when it works it is reliably and super-fast. We should examine combinations of kudzu and SIR, perhaps kudzu at first and then SIR when the likelihood curvature is greatly exceeded by the prior curvature, and also look at SIR-B.

Not every streaming data problem can have a skilled Bayesian computational statistician watching it 24/7. Perhaps there are some useful approaches for automatically detecting and flagging up SIR degeneracy, such as loss of correlation in the posteriors, multimodality, or posterior standard deviations close to the kernel bandwidths. Similarly, we need automatic monitoring tools for kudzu bandwidths.

I have only tried a crude form of local oversampling for SIR-B, a hypercuboid defined by maximum likelihood estimation. As p rises, the proportion of such a hypercuboid occupied by a hyperellipse (as an example of region where a convex likelihood function exceeds some value) drops rapidly. If we could find the manifold defining such a threshold in the likelihood and locally oversample within that, SIR-B could be truly scalable to high-p. The computational cost of the manifold hunt matters, of course. Of the options that come to mind, I am most interested in:

  • SIR-cholesky-B: under an assumption of multivariate normality, or at least convexity, rotate and scale the likelihood so that some off-the-shelf hyperellipsoid could be used
  • SIR-hulls-B: find hulls (not necessarily convex though), and locally oversample within them; Ludkin's Hug-and-Hop algorithm comes to mind here

As much of the code is in R, it could usefully be turned into a package. Some fixed parameters, such as the proportion of background diffuse multivariate normal density, should turn into arguments — hard-coded values likewise — and a new implementation of DET in Rcpp would make it more portable. I looked into the C++ code of MLPACK but the DET content cannot so easily be calved off from the rest (this is not a criticism; they have made a robust, well-integrated package). I might do this, but it is not a priority and might never get done. If you, dear Reader, would like to do it, go ahead. Don't try to make the whole of URINE into a package though, as I am working on that, so you will be able to call urine::some_function to generate proposed rotation matrices or evaluate them under a Cayley prior.

graph of proportion of hyper cuboid volume occupied by a hyper ellipse

The proportion of hypercuboid volume occupied by a hyperellipse that only just touches the faces of the hypercuboid at two points on either end of each of its axes, as a function of the number of dimensions. This is indicative of the proportion of SIR-B draws that would hit a convex likelihood if drawn from a uniform distribution over the hypercuboid.