James Murphy

Hadoop Introduction: Using Hadoop for Sequential Monte Carlo (Particle Filter) Parameter Estimation

Hadoop Introduction: Using Hadoop for Sequential Monte Carlo (Particle Filter) Parameter Estimation

Recently, I have become interested in parameter estimation for sequential Monte Carlo (SMC) methods (primarily, but not exclusively, particle filters), due to writing about them for my forthcoming book (obligatory plug – sorry). And I’ve also become interested in learning about Hadoop (an interest not uncorrelated to my current job search in the SF Bay area). So, I thought to myself, why not combine these interests into a single fun and exciting project!

This seemed like a good practice problem because:

  • It involves a real algorithm that I am interested in, understand well and have recently written about
  • Particle filters (and online parameter estimation) lend themselves (to some extent) to formulation in the MapReduce paradigm
  • It’s computationally demanding for big data sets, so someone might actually be interested in this (e.g. in the data assimilation community for things like weather and climate models)

Now let’s see how this turns out…  See parts I, II, III and IV.

Why Hadoop?

Well, why not?  Everyone is doing it!  But seriously, in some ways Hadoop doesn’t seem like an obvious platform choice for implementing PFs because in most cases the ‘weight’ of the Hadoop infrastructure is large compared to the amount of processing required in most PF applications for a particle or group of particles.  By weight, I mean all the overheads associated with dividing jobs up, farming them off to nodes, running mappers on them, gathering their output together and then ‘reducing’ it to something useful.  And this is true, the scale of most PFs is clearly much smaller than the sort of things that Hadoop can be used for.

But this need not always be true.  There are some very big PF systems being used for, for example, data assimilation in fluid modelling (i.e. taking data from sensors and incorporating it into existing models) for things like climate and weather modelling.  Because PFs can struggle with high-dimensional systems, a lot of effort is made in these to make sure that the particles are ‘good’, which requires a lot of computation to simulate very complicated stochastic systems.  Furthermore, these filters do require serious computational resources, and are often run on supercomputers.  So maybe there is a real application out there for this sort of filter.

More importantly (in terms of my motivation for this project), I wanted to gain at least a passing familiarity with this technology, which has become increasingly important in the data processing and large scale machine learning worlds in recent years.