First day : High dimensional signal analysis
Today was Stéphane Mallat’s day at Peyresq. Stéphane gave 4 hours on the topic of high-dimensional signal analysis, but his main focus was really to try to give mathematical and intuitive insights into the flabbergasting successes of deep neural networks. First Mallat gave a large but clear panorama of learning theory, covering supervised and unsupervised techniques, SVM and kernel methods. He clearly distinghuished cases where data live on small co-dimension domain of the data space, a property that can be exploited by manifold learning for instance, and problems where we are in high dimensions naturally. In the latter case, one can sometimes fall back on simpler methods when the signal to be learned is separable – the problem reduces to a series of low-dimensional problems. But often, there is no such simplifying assumption. He gave the example of the many body problems in gravitation where masses interact with each other in complex ways.
The problem in high dimensions is that we suffer the so called curse of dimensionality. Although there are many ways to understand it, the most intuitive is as follows: suppose you are to cover the domain on which your data lives with example points, so that it becomes easy to locally interpolate new, unseen, examples. The number of points you need to maintain a small distance between examples grows exponentially with the dimension of the data space. And quickly, when we are dealing with high-dimensional data, things become untractable. One solution of course is if your data lives on small dimensional subspace: the density of points can be sufficient to locally interpolate. But with no other assumption, it is not feasible to reduce the dimension. What do we do in this case ? If we were to try nearest neighbor interpolation, all neighbors of a given point would simply be too far away; this is the curse.
Mallat then explained the basic single layer neural net, showing that it is similar to approximating your data with a dictionary composed of ridge functions. A ridge function is a non-linear function applied to a simple weighted average of the data points. This shows that trying to use a single layer net basically reduces to approximation theory and one can therefore leverage this analogy to obtain a fundamental bound on learning efficiency. Essentially wherre controls the regularity of the function you’re trying to regress and is the dimension. You immediately see that if d is big, this result becomes meaningless…
Enter deep neural networks. In a deep net, you basically cascade the two basic ingredients of a single layer net: a linear combination of input data with weights to be learnt, composed with a non-linearity. No need to remind you here of the mind-blowing success of deep nets. The question is: why does it work so well ? The question of course is important if we also want to understand if it can fail, or if we can hope to make it even better.
Stéphane’s first ingredient is to model layers of a neural net as the composition of two contraction operators (a simple averaging operator and a non-linearity) whose ultimate goal is to reduce the volume of data. This is important: the goal is not to reduce dimensionality, but to reduce data volume, compressing points closer so that their density becomes useful for local interpolation while maintaining a sufficient margin between classes: our problem would then be to interpolate a regular function on a high-density set of points – easy !!! The idea of contraction therefore makes a lot of sense.
Stéphane’s second ingredient is to quotient out variability in the data by using representations that are invariant. It is easy to craft invariance to translation and to fixed transformation groups by means of (generalized) convolutions with filters (this is an averaging operation, it enforces invariance) and taking moduli (think of removing the phase of the Fourier transform). Let us explain this more intuitively : if we were to try to be invariant to translations only, taking the modulus of the Fourier transform would do. If however you want to be almost invariant to small deformation (diffeomorphisms) Fourier will not work : even a small diffeomorphism would generate significant perturbations at higher frequencies. Are we doomed ? Not really. Wavelets are precisely shaped to counter the increase of those perturbations at high frequencies since their frequency support gets higher with frequency. It does make a lot of sense then to use the modulus of wavelet coefficients to be invariant to both translations and small diffeomorphisms. A scattering transform is the object you get when you cascade these contracting operators.
The scattering transform is a very interesting way to explore the properties of deep networks, because you can actually understand (even if intuitively) why things work. Stéphane then used the scattering transform to tackly high dimensional classification problems. Notice that, in his construction and in stark contrast to deep nets, you don’t train anything: you use your knowledge of invariance to construct the scattering transform and then apply a simple classifier to the scattering coefficients.
This is all great when you know what transformations you want to be invariant to. But what happens if you look at categories of images ? There is no geometric regularity among images of beavers for instance. How do you maintain invariance in this case. This will be the topic of our next post 🙂