Among the great things at the Peyresq summer school is the wonderful atmosphere. Of course, staying in a stunningly beautiful place helps. Here is the french community packed in front the village’s only TV set, agonisingly waiting for France to score against Ecuador.
I am getting a little behind schedule, so let us start by summarizing the last talk given yesterday by Jean-François Cardoso.
Big Bang, Big Data
Jean-François gave a wonderful introduction to the Planck mission. For those who don’t know Planck, it is in my opinion one of the most amazing feat of human intelligence in this century. Planck is an ESA satellite measuring with unprecendented precision the Cosmic Microwave Background (CMB). The CMB is in some sense the very first picture of the universe: right after the big bang, the universe was a very hot plasma of charged particles in which light could not travel freely – any photon was constantly scattered by interactions with these charged particles. But roughly after 380000 years, because of expansion, the universe cooled down to a critical temperature at which electrons and nuclei can bind into neutral atoms. At that point, photons were allowed to travel uninjured and we are still receiving them today in the form of a 3K microwave radition that was first detected by Penzas and Wilson in the 60s (although it had been predicted by Gamow I believe). This CMB is an amazingly important object because it carries a lot of information about the universe in its early stage, but also about what happened afterwards. Indeed, Jean-François explained that tiny temperature fluctuations (of only 10^-5 relative amplitude) in the CMB are related to little density inhomogeneities in the plasma that are responsible, by gravitational effects, for the formation of galaxies and the structure of the universe as we know it. Simply put, this is our cosmic DNA. There are many challenges in processing the CMB as explained by Cardoso. Jean-François focused on his work that aimed at using the various frequency bands of the observations to separate the CMB from various pesky foreground contamination. Jean-François used a source separation technique that works so well it is now the de facto technique used to create CMB maps. At the heart of the technique is the use of Fourier techniques to exploit the knowledge of the CMB power spectral density and separate it from contamination.
Second day: follow-up on Mallat’s talk
Let us start by summarizing, very schematically, the end of Stéphane’s talk. The idea is now that our classes do not feature simple geometric invariances (think of images of beavers). This is where things get tricky but potentially even more interesting. First, taking convolutions along geometric deformations will be simplified by switching to a Haar wavelet transform. In the Haar transform, you only need to compute averages and differences between pairs of pixels – they are taken aligned on the signal grid in the classical transform. For Haar scattering, this is generalized by allowing to apply these operations to any pair of points. The question is now how do we choose those pairs ? This seems combinatorially hopeless … Stéphane’s argument is that this pairing can be learned by imposing that we reduce space volume while maintaining inter class margins. Interestingly, but this is out of the scope of this summary, this can be turned into a convex optimization problem. Comparing the scattering transform to some of the most recent deep nets shows that with only two layers performances are comparable but with much less training complexity. The most sophisticated deep nets then add extra layers that are not exactly similar to the first ones and it is not clear what these try to achieve.
Large random matrices
Jamal Najim gave a wonderful first course on large random matrices. This was an introduction to celebrated results, like Wigner’s semi-circle law that controls the density of eigenvalues of large random symmetric matrices. Jamal then took us through the proof of the Marcenko-Pastur which provides the asymptotic density of eigenvalues for large covariance matrices. This was really a tour de force since the proof is rather complex, but Najim highlighted its architecture (a subtle mix of probability and algebra, glued with the Stieltjes transform) with great pedagogy.
First day : High dimensional signal analysis
Today was Stéphane Mallat’s day at Peyresq. Stéphane gave 4 hours on the topic of high-dimensional signal analysis, but his main focus was really to try to give mathematical and intuitive insights into the flabbergasting successes of deep neural networks. First Mallat gave a large but clear panorama of learning theory, covering supervised and unsupervised techniques, SVM and kernel methods. He clearly distinghuished cases where data live on small co-dimension domain of the data space, a property that can be exploited by manifold learning for instance, and problems where we are in high dimensions naturally. In the latter case, one can sometimes fall back on simpler methods when the signal to be learned is separable – the problem reduces to a series of low-dimensional problems. But often, there is no such simplifying assumption. He gave the example of the many body problems in gravitation where masses interact with each other in complex ways.
The problem in high dimensions is that we suffer the so called curse of dimensionality. Although there are many ways to understand it, the most intuitive is as follows: suppose you are to cover the domain on which your data lives with example points, so that it becomes easy to locally interpolate new, unseen, examples. The number of points you need to maintain a small distance between examples grows exponentially with the dimension of the data space. And quickly, when we are dealing with high-dimensional data, things become untractable. One solution of course is if your data lives on small dimensional subspace: the density of points can be sufficient to locally interpolate. But with no other assumption, it is not feasible to reduce the dimension. What do we do in this case ? If we were to try nearest neighbor interpolation, all neighbors of a given point would simply be too far away; this is the curse.
Mallat then explained the basic single layer neural net, showing that it is similar to approximating your data with a dictionary composed of ridge functions. A ridge function is a non-linear function applied to a simple weighted average of the data points. This shows that trying to use a single layer net basically reduces to approximation theory and one can therefore leverage this analogy to obtain a fundamental bound on learning efficiency. Essentially wherre controls the regularity of the function you’re trying to regress and is the dimension. You immediately see that if d is big, this result becomes meaningless…
Enter deep neural networks. In a deep net, you basically cascade the two basic ingredients of a single layer net: a linear combination of input data with weights to be learnt, composed with a non-linearity. No need to remind you here of the mind-blowing success of deep nets. The question is: why does it work so well ? The question of course is important if we also want to understand if it can fail, or if we can hope to make it even better.
Stéphane’s first ingredient is to model layers of a neural net as the composition of two contraction operators (a simple averaging operator and a non-linearity) whose ultimate goal is to reduce the volume of data. This is important: the goal is not to reduce dimensionality, but to reduce data volume, compressing points closer so that their density becomes useful for local interpolation while maintaining a sufficient margin between classes: our problem would then be to interpolate a regular function on a high-density set of points – easy !!! The idea of contraction therefore makes a lot of sense.
Stéphane’s second ingredient is to quotient out variability in the data by using representations that are invariant. It is easy to craft invariance to translation and to fixed transformation groups by means of (generalized) convolutions with filters (this is an averaging operation, it enforces invariance) and taking moduli (think of removing the phase of the Fourier transform). Let us explain this more intuitively : if we were to try to be invariant to translations only, taking the modulus of the Fourier transform would do. If however you want to be almost invariant to small deformation (diffeomorphisms) Fourier will not work : even a small diffeomorphism would generate significant perturbations at higher frequencies. Are we doomed ? Not really. Wavelets are precisely shaped to counter the increase of those perturbations at high frequencies since their frequency support gets higher with frequency. It does make a lot of sense then to use the modulus of wavelet coefficients to be invariant to both translations and small diffeomorphisms. A scattering transform is the object you get when you cascade these contracting operators.
The scattering transform is a very interesting way to explore the properties of deep networks, because you can actually understand (even if intuitively) why things work. Stéphane then used the scattering transform to tackly high dimensional classification problems. Notice that, in his construction and in stark contrast to deep nets, you don’t train anything: you use your knowledge of invariance to construct the scattering transform and then apply a simple classifier to the scattering coefficients.
This is all great when you know what transformations you want to be invariant to. But what happens if you look at categories of images ? There is no geometric regularity among images of beavers for instance. How do you maintain invariance in this case. This will be the topic of our next post 🙂
I am preparing material for the upcoming Peyresq summer school on signal processing in high dimensions. My lectures will cover the emerging field of signal processing on graphs, following a rather classical approach based on harmonic analysis. By harmonic analysis on graphs or networks, I really mean building on the spectral theory of the graph Laplacian to construct a coherent data processing framework. It allows us to understand localisation, smoothness, to filter and construct multiscale transforms (i.e wavelets), all this on generic graphs and networks therefore generalising many familiar concepts from signal processing. These tools can then be used for applications like de-noising, missing data imputation, machine learning, building recommender systems etc …
Here is in more details what I plan to cover, in five lectures:
- I introduce the graph Laplacian , it’s spectral theory and the associated Borel functional calculus. I’ll quickly discuss other discrete differential operators (gradient, divergence). These can be viewed as linear operators acting on signals defined at the vertices of the graph. I use them to characterise the smoothness of signals and explain how spectral graph theory is used for clustering, finding embeddings (Laplacian eigenmaps). I conclude by designing a de-noising framework based on Tikhonov regularisation that hints at the idea of graph filters defined on the spectral domain.
- In the second lecture I define more formally spectral kernels and how they are used to construct a kernel localisation operator that acts like a convolution. I detail the fundamental limits on the localisation one can obtain based on the smoothness of the generating kernel.
- In the third lecture I describe the construction of wavelets on graphs and provide an efficient algorithm to implement the wavelet transform. I highlight some applications to machine learning (transductive learning), recommender systems but also to processing point clouds. I also introduce a Gabor-like transform and discuss its use in constructing an uncertainty principle.
- The fourth lecture is devoted to a multiscale framework unifying wavelets, interpolation on graphs with operations such as coarsening and sparsification .
- The fifth lecture is devoted to results linking spectral graph theory to the spectral theory of the Laplace-Beltrami operator and why this is interesting in manifold learning for instance.