# Peyresq 2014: Last day

Tags

The 2014 edition of the Peyresq summer school just finished. Before the coda, let us summarize the very last talk of Jamal Najim on his series on large random matrices. Jamal’s talk was focused on applications of results he described us in the previous day. I will focus on the application to direction of arrival estimation (DOA), although Jamal also discussed MIMO communications.
In DOA the goal is to estimate the directions to r sources from n measurements that mix the sources linearly with phase factors and additive noise:
$\vec{y} = \sum_{\ell=1}^{r} \vec{a}(\varphi_\ell)s_\ell + \sigma \vec{w}$
The classical MUSIC algorithm uses the fact that one can obtain a characterization of the signal subspace from the measurements~:
$\vec{a}^* (\varphi) \bigl( I_N - \Pi_N \bigr) \vec{a} (\varphi) = 0 \Leftrightarrow \varphi \in \{\varphi_1, ..., \varphi_r\}$
where $\Pi_N$ is the ortho projection on the span of the $\vec{a}(\varphi_\ell$. When the sources are drawn at random, this argument is replaced by an estimation using the ortho projection on the eigenvectors of the measurements covariant matrix since~:
$\frac{1}{n} \mathbb{E}\{Y_N Y_N^*\} = A_N(\vec{\varphi}) \mathbb{E}\{ \frac{S_N S_N^*}{n}\} A^*_N(\vec{\varphi}) + \sigma^2 I_N$.
This naturally leads to replacing $\Pi_N$ with the projection on the first r eigenvectors of the sample covariance matrix. Except that this is not a super good idea: modeling the sources by complex gaussian variables, one obtains a multiplicative spiked model with a rank r perturbation for the observations and we have learned with Jamal that the sample covariance matrix is not a good estimator of the covariance in this case. Fortunately the theory of large random matrices tells us how we can construct a consistent estimator which turns out to involve if a simple de-biasing correction. Jamal then showed with simulations how and when this method improves on the original MUSIC.

This was the conclusion of an absolutely marvelous week at Peyresq. I could not seriously express with words how much I enjoyed the week: I learned a lot of new things, discussed with many, got constructive feedback on my own work and this in turn generated many ideas I now want to work on. Every workshop should be like this instead of gigantic waste of time, energy and money (a whole week at Peyresq will set you back much less than the registration at those big conferences).

There is always a bit of saudade when such an excellent week ends. But here is an enormous thank you to :
. Patrick and Cédric who organized the event and made sure it was nothing short of perfect
. The Peyresq local team who makes it such a wonderful place to stay at
. All participants, speakers and students: you guys stay cool and keep up the good work!

# Fifth day at Peyresq, continued

Tags

This post is courtesy of Pierre Borgnat, a huge thanks !

And what about Pierre Vandergheynst’s lectures ? Spread in three sessions, Pierre discussed about the recent advances in Harmonic Analysis on Graphs an Networks. The plan of the lectures was already given on the blog and I will cover a few salient points. In short, this was a marvellous series of lectures, especially given the necessary flexibility that Pierre displayed, when continuing his lecture on the black board due an electricity outage — then improvise a follow-up with both blackboard and a slides due to popular insistence!

First session on the second day, Pierre covered all basics of Spectral Graph Theory and all participants had a refresh about Laplacian matrices, spectrum of a graph, spectral cuts, spectral clustering and graph embedding using Laplacian eigenmaps. A nice example was shown for matching points clouds of a warped surface, for instance the shape of a map in point clouds — unfortunately, this would not work well if someone with his hand on the top of his head
Then, attacking the main core of the lectures, Pierre defined operators to study signals indexed on graphs: gradients, regularity, total variation. A central point is the intriguing and important notion coherence of a graph (the largest absolute value of the components of the eigenvectors of the Laplacian) — more on that later on.
Using functional calculus on graphs, it’s possible to define operators through the spectral domain so as, for instance, solve diffusion equation on a graph. The good thing is that it is now possible to give meaning to what is a filtering of graph signals (with examples of denoising or recommendation system). End of the first session.

Third day, second lecture: no electricity… And so what ? Pierre improvised on the blackboard. In the lectures, he showed that it is possible to define first operators to localise a kernel on a graphs and then how it can be interpreted as a translation so as to propose a generalisation of the convolution operator. Still, the “agonizing limit of intuition” limits some of the properties: the translation operator is not unitary, when its coherence departs from 1/\sqrt{N}. That might problematic in specific situations, or will that be an asset to study the local topology of the graphs? (we would be salved when studying graphs on manifolds or accepting a group of translation — these are developments not covered here). With all that, it was direct to define the spectral graph wavelets that provide a continuous and invertible transform using function localized both in frequency and vertices. The good points is that it’s possible to obtain a frame, even a tight frame; that they have nice localisation properties; and finally that implementation can be efficient using Chebyshev polynomials (“love Chebyshev !”). Better, a python toolbox is soon enough available… (the SGWT toolbox in matlab already exists).

Last day at Peyresq: two more hours par Pierre in which we learned how to put coarsen both signals and graphs by putting wavelets in a pyramidal algorithm. Fist ingredient: downsample the graphs using the nodal domains of the (or one of the) eigenvectors with largest eigenvalues of the Laplacian. Second ingredient: filter the signals (easy, remember that we learnt about graph filters and wavelets). Third ingredient: coarsen the graph y keeping only half of the nodal domains, and transform the graph using the Kron reduction (that should activate your memories of basic electricity). A major property is that the Kron reduction algebraically written as the Schur complement of a matrix) preserves properties of the Laplacian and it is useful to compute spline-like interpolation on a signal. This allows us to see the difference between the filtered signal retained on the coarsened graph and the original signal on the original graph (having all the vertices). With all that, you have a sound Laplacian pyramid to filter and coarsen graph signals. To make all that more efficient, the sparsification of the coarsened graphs is useful, using the work of Spielman and Srivastava (second side message of the lecture: once you now all about Chebyshev polynomials, read this paper in SIAM J. Comp. of 2011). The conclusion is that many applications about filtering on graphs, learning on graphs, visualisation of graphs, and so on, just wait in our reach. At the end of the course, it was a very motivating series of lectures.

What was skipped in the lectures? Materials were given in the slides about manifold learning with graph spectral theory or vertex-frequency analysis (in the spirit of the time-frequency analysis or Gagor transform — yet on graphs). I advise all of you to read the nice paper (on ArXiv) by David Shuman, Benjamin Ricaud and P. V. on the latter subject, and to discuss with Pierre about the former one.

# Fifth day at Peyresq

Tags

This post is courtesy of Johan and Nathanaël – a big thanks !

large random matrix theory applied to Multi Input Multi Output (MIMO)
antenna systems and large random networks.

He started with a short introduction to information theory, especially
to capacity of a communication channel, on the blackboard, with style.
Olivier then showed a very powerful result in the field of MIMO
communications, which states that when the number $\latex n$ of antennas
grows we get a capacity or order $\latex n$, or

$C_n = n \int_0^4 \log ( 1 + Px) q(x) dx$.

This result is powerful since it means that theoretically the capacity
grows linarly with the number of antennas (which is a significant
improvement of the previously believed limiting order

$C = log(1 + Pn^2)$

for multi antennas systems).

Olivier also enphasized that we can achieve this capacity without
increasing the power P, but simply by distributing it in a clever way,
which can be done using water-filling techniques. He concluded the first
part of the lecture by reminding that, although this result is
demonstrated for n goes to infinity, it actually works already for small
values of n, which explains the joke common amongst larse random matrix
people which states : $8 = \infty$.

The second main part of Oliver’s talk dealed with large self-organized
wireless communication networks. The main question proposed to answer
was: assuming that n users want to communicate in a given area with an
O(1) density, what is the rate $R_n$ they can hope to achieve ?
And in relation to this statement, how does the rate scale with n ?

In a setup where we wish to establish n communication channels, we use a
model where the signals that are received depend on the gain between two
nodes divided by the distance, with an additive white gaussian noise.
Note that the gain is i.i.d. and follows a complex gaussian
distribution. Finally the power of emission per user is limited and has
a given value P. For this model we would like to achieve $R_n = O(1)$

Following this model, a very naive scheme is to dispatch the n different
pairwise communication instances into n different time slots (or
equivalently n different frequency bands). In this case we achieve
$R_n = O(\frac{1}{n})$ which is really bad for n large.

The second approach uses multi-hop communication. And in this case,
Olivier explained that due to an overload of the users in the center of
the network the limiting rate is $R_n = O(\frac{1}{\sqrt{n}}$,
which is better but still not good.

Finally, Oliver presented a scheme coined Distributed MIMO
communications which proceeds in an iterative way. At each stage, a user
broadcasts its message to his neighborhood, which will then send the
message to the neighborhood of the destination user, which receives all
the messages from its own neighbors. We can then recursively use this
scheme for the communication within the neighborhood. In this case,
Oliver showed that the limiting rate for large n with k recursions is
$R_n = O(n^{\frac{-1}{k+1}})$ which tends to $O(1)$ when
k goes to infinity.

model, which states that considering gains to be i.i.d. is in fact
unrealistic. Some authors indeed claims that for this setup we can only
hope to have $O(2n)$ i.i.d. random variables (instead of the
$O(n^2)$ that are needed to prove the limits presented above. And
this would bring down the rate back to $O(\frac{1}{\sqrt{n}})$.
The final word provided a way to avoid this problem simply by stating
that practical networks tend to have a lower density, and that we can
achieve interesting rates given that the area is large enough.

# Fourth day at Peyresq

Tags

This morning we first had a talk by Pierre Borgnat on the use of data analytics for bike renting systems, which gives you access to various rhythms of a city.

We then all went for a super nice trekking around Peyresq, in pristine weather conditions: talk about perfection …

In the afternoon Pierre Borgnat gave a more technical two hours lecture on multiscale community detection and dynamic graphs. Pierre first convinced us that spectral clustering is not necessarily very good for community detection because there is no measure of partition quality and no reference to a null hypothesis (no community). This is why modularity was introduced. Modularity measures how much your graph departs from a null hypothesis where your graph is random with the same degrees. Optimizing modularity is difficult (NP complete?) but there exists a greedy algorithm (the Louvain method, what else?) that solves it, approximately, in polynomial time. Pierre then set on a challenging agenda. He first discussed cases where one really needs a multiscale definition of communities. In fact this is quite often the case since they tend to agglomerate at various scales and allow for rich interpretation. Borgnat used a wavelet transform transform on graphs to give an original and intuitive definition of what multiscale communities are. One of the niceties is that the method comes with a fast algorithm that scales up to fairly large graphs. Another important point stressed by Pierre is the definition of a stability criterion that allows to judge partitions. Pierre showed various applications and a second talk by Rasha Boulos even used the technique in genomics in a very interesting way. In the second part of his talk, Pierre discussed dynamical graph models. This is a very emerging domain where a lot remains to be done. Pierre wet our appetite with interesting ideas that allow to detect and track patterns in dynamical applications.

# Interlude

Tags

Among the great things at the Peyresq summer school is the wonderful atmosphere. Of course, staying in a stunningly beautiful place helps. Here is the french community packed in front the village’s only TV set, agonisingly waiting for France to score against Ecuador.

Yesterday morning was also fun thanks to a power outage that left us without beamer. I improvised my talk on the blackboard, with a little help from Patrick and Olivier.

# Third day at Peyresq

Tags

I am getting a little behind schedule, so let us start by summarizing the last talk given yesterday by Jean-François Cardoso.

Big Bang, Big Data

Jean-François gave a wonderful introduction to the Planck mission. For those who don’t know Planck, it is in my opinion one of the most amazing feat of human intelligence in this century. Planck is an ESA satellite measuring with unprecendented precision the Cosmic Microwave Background (CMB). The CMB is in some sense the very first picture of the universe: right after the big bang, the universe was a very hot plasma of charged particles in which light could not travel freely – any photon was constantly scattered by interactions with these charged particles. But roughly after 380000 years, because of expansion, the universe cooled down to a critical temperature at which electrons and nuclei can bind into neutral atoms. At that point, photons were allowed to travel uninjured and we are still receiving them today in the form of a 3K microwave radition that was first detected by Penzas and Wilson in the 60s (although it had been predicted by Gamow I believe). This CMB is an amazingly important object because it carries a lot of information about the universe in its early stage, but also about what happened afterwards. Indeed, Jean-François explained that tiny temperature fluctuations (of only 10^-5 relative amplitude) in the CMB are related to little density inhomogeneities in the plasma that are responsible, by gravitational effects, for the formation of galaxies and the structure of the universe as we know it. Simply put, this is our cosmic DNA. There are many challenges in processing the CMB as explained by Cardoso. Jean-François focused on his work that aimed at using the various frequency bands of the observations to separate the CMB from various pesky foreground contamination. Jean-François used a source separation technique that works so well it is now the de facto technique used to create CMB maps. At the heart of the technique is the use of Fourier techniques to exploit the knowledge of the CMB power spectral density and separate it from contamination.

# Second day at Peyresq

Second day: follow-up on Mallat’s talk

Let us start by summarizing, very schematically, the end of Stéphane’s talk. The idea is now that our classes do not feature simple geometric invariances (think of images of beavers). This is where things get tricky but potentially even more interesting. First, taking convolutions along geometric deformations will be simplified by switching to a Haar wavelet transform. In the Haar transform, you only need to compute averages and differences between pairs of pixels – they are taken aligned on the signal grid in the classical transform. For Haar scattering, this is generalized by allowing to apply these operations to any pair of points. The question is now how do we choose those pairs ? This seems combinatorially hopeless … Stéphane’s argument is that this pairing can be learned by imposing that we reduce space volume while maintaining inter class margins. Interestingly, but this is out of the scope of this summary, this can be turned into a convex optimization problem. Comparing the scattering transform to some of the most recent deep nets shows that with only two layers performances are comparable but with much less training complexity. The most sophisticated deep nets then add extra layers that are not exactly similar to the first ones and it is not clear what these try to achieve.

Large random matrices

Jamal Najim gave a wonderful first course on large random matrices. This was an introduction to celebrated results, like Wigner’s semi-circle law that controls the density of eigenvalues of large random symmetric matrices. Jamal then took us through the proof of the Marcenko-Pastur which provides the asymptotic density of eigenvalues for large covariance matrices. This was really a tour de force since the proof is rather complex, but Najim highlighted its architecture (a subtle mix of probability and algebra, glued with the Stieltjes transform) with great pedagogy.

# First day at Peyresq

First day : High dimensional signal analysis

Today was Stéphane Mallat’s day at Peyresq. Stéphane gave 4 hours on the topic of high-dimensional signal analysis, but his main focus was really to try to give mathematical and intuitive insights into the flabbergasting successes of deep neural networks. First Mallat gave a large but clear panorama of learning theory, covering supervised and unsupervised techniques, SVM and kernel methods. He clearly distinghuished cases where data live on small co-dimension domain of the data space, a property that can be exploited by manifold learning for instance, and problems where we are in high dimensions naturally. In the latter case, one can sometimes fall back on simpler methods when the signal to be learned is separable – the problem reduces to a series of low-dimensional problems. But often, there is no such simplifying assumption. He gave the example of the many body problems in gravitation where masses interact with each other in complex ways.

The problem in high dimensions is that we suffer the so called curse of dimensionality. Although there are many ways to understand it, the most intuitive is as follows: suppose you are to cover the domain on which your data lives with example points, so that it becomes easy to locally interpolate new, unseen, examples. The number of points you need to maintain a small distance between examples grows exponentially with the dimension of the data space. And quickly, when we are dealing with high-dimensional data, things become untractable. One solution of course is if your data lives on small dimensional subspace: the density of points can be sufficient to locally interpolate. But with no other assumption, it is not feasible to reduce the dimension. What do we do in this case ? If we were to try nearest neighbor interpolation, all neighbors of a given point would simply be too far away; this is the curse.

Mallat then explained the basic single layer neural net, showing that it is similar to approximating your data with a dictionary composed of ridge functions. A ridge function is a non-linear function applied to a simple weighted average of the data points. This shows that trying to use a single layer net basically reduces to approximation theory and one can therefore leverage this analogy to obtain a fundamental bound on learning efficiency. Essentially $\| f - f_M\| \leq c M^{-\alpha/d}$ wherre $\alpha$ controls the regularity of the function you’re trying to regress and $d$ is the dimension. You immediately see that if d is big, this result becomes meaningless…

Enter deep neural networks. In a deep net, you basically cascade the two basic ingredients of a single layer net: a linear combination of input data with weights to be learnt, composed with a non-linearity. No need to remind you here of the mind-blowing success of deep nets. The question is: why does it work so well ? The question of course is important if we also want to understand if it can fail, or if we can hope to make it even better.

Stéphane’s first ingredient is to model layers of a neural net as the composition of two contraction operators (a simple averaging operator and a non-linearity) whose ultimate goal is to reduce the volume of data. This is important: the goal is not to reduce dimensionality, but to reduce data volume, compressing points closer so that their density becomes useful for local interpolation while maintaining a sufficient margin between classes: our problem would then be to interpolate a regular function on a high-density set of points – easy !!! The idea of contraction therefore makes a lot of sense.

Stéphane’s second ingredient is to quotient out variability in the data by using representations that are invariant. It is easy to craft invariance to translation and to fixed transformation groups by means of (generalized) convolutions with filters (this is an averaging operation, it enforces invariance) and taking moduli (think of removing the phase of the Fourier transform). Let us explain this more intuitively : if we were to try to be invariant to translations only, taking the modulus of the Fourier transform would do. If however you want to be almost invariant to small deformation (diffeomorphisms) Fourier will not work : even a small diffeomorphism would generate significant perturbations at higher frequencies. Are we doomed ? Not really. Wavelets are precisely shaped to counter the increase of those perturbations at high frequencies since their frequency support gets higher with frequency. It does make a lot of sense then to use the modulus of wavelet coefficients to be invariant to both translations and small diffeomorphisms. A scattering transform is the object you get when you cascade these contracting operators.

The scattering transform is a very interesting way to explore the properties of deep networks, because you can actually understand (even if intuitively) why things work. Stéphane then used the scattering transform to tackly high dimensional classification problems. Notice that, in his construction and in stark contrast to deep nets, you don’t train anything: you use your knowledge of invariance to construct the scattering transform and then apply a simple classifier to the scattering coefficients.

This is all great when you know what transformations you want to be invariant to. But what happens if you look at categories of images ? There is no geometric regularity among images of beavers for instance. How do you maintain invariance in this case. This will be the topic of our next post🙂

# Harmonic analysis on graphs and networks

Tags

I am preparing material for the upcoming Peyresq summer school on signal processing in high dimensions. My lectures will cover the emerging field of signal processing on graphs, following a rather classical approach based on harmonic analysis. By harmonic analysis on graphs or networks, I really mean building on the spectral theory of the graph Laplacian to construct a coherent data processing framework. It allows us to understand localisation, smoothness, to filter and construct multiscale transforms (i.e wavelets), all this on generic graphs and networks therefore generalising many familiar concepts from signal processing. These tools can then be used for applications like de-noising, missing data imputation, machine learning, building recommender systems etc …

Here is in more details what I plan to cover, in five lectures:

1. I introduce the graph Laplacian $\mathcal{L}$, it’s spectral theory and the associated Borel functional calculus. I’ll quickly discuss other discrete differential operators (gradient, divergence). These can be viewed as linear operators acting on signals defined at the vertices of the graph. I use them to characterise the smoothness of signals and explain how spectral graph theory is used for clustering, finding embeddings (Laplacian eigenmaps). I conclude by designing a de-noising framework based on Tikhonov regularisation that hints at the idea of graph filters defined on the spectral domain.
2. In the second lecture I define more formally spectral kernels and how they are used to construct a kernel localisation operator  that acts like a convolution. I detail the fundamental limits on the localisation one can obtain based on the smoothness of the generating kernel.
3. In the third lecture I describe the construction of wavelets on graphs and provide an efficient algorithm to implement the wavelet transform. I highlight some applications to machine learning (transductive learning), recommender systems but also to processing point clouds. I also introduce a Gabor-like transform and discuss its use in constructing an uncertainty principle.
4. The fourth lecture is devoted to a multiscale framework unifying wavelets, interpolation on graphs with operations such as coarsening and sparsification .
5. The fifth lecture is devoted to results linking spectral graph theory to the spectral theory of the Laplace-Beltrami operator and why this is interesting in manifold learning for instance.