# The Immobile Migrants

[File this post under awareness – I am not asking anything else]

If you’re a parent, you’ll agree with me that it is hard to imagine worse than a lack of present and future for your kids. No little friends coming at home. No birthday parties invitations. No school play. No soccer matches on Sundays. No graduation ceremony. No first girlfriend. No first job and the joy (sweet agony!) of seeing your loved one grow into an independent adult, ready to play their part.

Yesterday, we had a meeting at our son’s school during which his future was taken away from him and from us. My 6 years old son is on the autism spectrum. My beautiful, funny, mischievous son is about to enter limbos unless his mother and I fight violently enough to give him a chance at a decent life DESPITE public authorities. Here is what happened. Our son spent two years in a private school, inside a regular class. We funded a 100% in class assistance to help him focus and interact with the teacher and other kids. This was part of a whole network of therapists around him. He made a lot of progress and followed the cursus, learning to count, do basic operations, read and write (though his writing is not good because he does not have a good grip – working on it). After these two years, kids enter a new cycle and we had to decide wether or not to stay in the private system or move to the public one. Costs aside, the public system seemed a good choice because they have a mission of being inclusive with neuro-atypical kids and they have a privileged access to the public health system. On paper it looked excellent and our son would join his older sister at her school. We met with the school’s dean, state school inspector and part of our self-established network. We, and the people taking care of him, described our kid and his challenges. The school told us they had some experience and even had a teacher who volunteered. It was almost too good to be true.

Here is a story that keeps repeating itself over and over: the public authorities set up a policy, politicians will communicate how wonderful it is. But in practice no means will be invested. In practice, the school has no experience, no dedicated staff, no program. They were thinking that they would just welcome our son and he would simply have to learn to behave like a “normal” kid. If you have a child on the spectrum you see where this is going, don’t you ? We did not see it coming. We had prepared him to school over the summer, visited the place. We set up a schedule. He happily started the year and we thought everything was alright. Unfortunately it wasn’t. It started with his older sister (9 years old) bursting into tears at home, explaining us that he is always left on his own during breaks. Some teachers forbade her to play with her brother because he “had to learn to play with others”. Then he had a full meltdown at school. If you have a kid on the spectrum, you know what I am talking about and I won’t describe it here. We were called to the school and discussed with the teacher and the school dean. What unfolded was pure disaster.

Except a few hours per week (!!!) where an assistant (with no qualification whatsoever) helped him, nothing was done to help our son integrate smoothly. No schedule, no program, no hint at a method. He was just supposed to adapt all by himself. Of course it did not work, but the school did not tell us for a full 3 weeks. During breaks he was on his own with the noise from 200 kids around. The teacher thought it is not a good idea if he uses his headphones. The school felt it was fine because “he seemed happy just walking around a bench”. In fact he was just trying to manage his anxiety until it was too much and he melted down.

In class, with no specific support he regressed. They quickly stopped trying to work with him and let him spend most of his time drawing, instead of the exercices he was perfectly capable of 6 months ago. They described him as a disturbance for the other kids because he was noisy.  Of course, he was left on his own with no instruction and support: typically a kid with ASD freaks out and does typical routines to re-assure himself in this situation. He had to completely lose it for the school to finally tell us something was wrong. At home he was getting back to his known universe and was ok, until the meltdown – it took a few days to reassure him.

When we met with the teacher and dean and discovered what had happened for 3 weeks, we were in shock. But the worse was to hear our boy being described as a disturbance. They were not asking for help or input. They were telling us to take him back and place him in a specialised institution. Gone were the good will and smiles of the politicians, the grand talks of the school being open.

We have since called and met various specialists. Did we do something wrong? Did we misunderstand that the best place for our son was a regular school ? Did we misunderstand that the school ought to put a program in place for him ? Why didn’t they do an extra effort at the beginning and then gradually decrease support ? Were the evaluations of his cognitive abilities completely wrong ?

I can only compare the life of autistic kids and their family to those migrants trying to reach a better future in Europe these days*. Every day and in every aspect of our son’s life, we face an ocean to cross. We hurl ourselves into these raging waters, with little or no support. We fail. We sink. We cling to whatever we can. And start again. And yet, if we manage to cross we are only met with contempt and exclusion. We are the immobile migrants. Ours is a fight through time and against time as we flee ignorance and segregation.

*Don’t get me wrong: I am not in any way comparing our situation to that of migrants in Europe. These people live through a considerably more dramatic situation. I can not even begin to imagine the courage it takes to take your family through war, deserts and the raging seas in hope of a better future. They are the true heroes of our time, and one day shall be recognised as such.

# Blog re-structured

Back on the interwebs. I have decided to reorganise my online life. As of today the technical side of me lives on my professional blog. This blog will therefore be dedicated to more personal stuff and opinions – yes indeed it is better to avoid mixing both worlds and my employer is not responsible for my musings here.

# Peyresq 2014: Last day

Tags

The 2014 edition of the Peyresq summer school just finished. Before the coda, let us summarize the very last talk of Jamal Najim on his series on large random matrices. Jamal’s talk was focused on applications of results he described us in the previous day. I will focus on the application to direction of arrival estimation (DOA), although Jamal also discussed MIMO communications.
In DOA the goal is to estimate the directions to r sources from n measurements that mix the sources linearly with phase factors and additive noise:
$\vec{y} = \sum_{\ell=1}^{r} \vec{a}(\varphi_\ell)s_\ell + \sigma \vec{w}$
The classical MUSIC algorithm uses the fact that one can obtain a characterization of the signal subspace from the measurements~:
$\vec{a}^* (\varphi) \bigl( I_N - \Pi_N \bigr) \vec{a} (\varphi) = 0 \Leftrightarrow \varphi \in \{\varphi_1, ..., \varphi_r\}$
where $\Pi_N$ is the ortho projection on the span of the $\vec{a}(\varphi_\ell$. When the sources are drawn at random, this argument is replaced by an estimation using the ortho projection on the eigenvectors of the measurements covariant matrix since~:
$\frac{1}{n} \mathbb{E}\{Y_N Y_N^*\} = A_N(\vec{\varphi}) \mathbb{E}\{ \frac{S_N S_N^*}{n}\} A^*_N(\vec{\varphi}) + \sigma^2 I_N$.
This naturally leads to replacing $\Pi_N$ with the projection on the first r eigenvectors of the sample covariance matrix. Except that this is not a super good idea: modeling the sources by complex gaussian variables, one obtains a multiplicative spiked model with a rank r perturbation for the observations and we have learned with Jamal that the sample covariance matrix is not a good estimator of the covariance in this case. Fortunately the theory of large random matrices tells us how we can construct a consistent estimator which turns out to involve if a simple de-biasing correction. Jamal then showed with simulations how and when this method improves on the original MUSIC.

This was the conclusion of an absolutely marvelous week at Peyresq. I could not seriously express with words how much I enjoyed the week: I learned a lot of new things, discussed with many, got constructive feedback on my own work and this in turn generated many ideas I now want to work on. Every workshop should be like this instead of gigantic waste of time, energy and money (a whole week at Peyresq will set you back much less than the registration at those big conferences).

There is always a bit of saudade when such an excellent week ends. But here is an enormous thank you to :
. Patrick and Cédric who organized the event and made sure it was nothing short of perfect
. The Peyresq local team who makes it such a wonderful place to stay at
. All participants, speakers and students: you guys stay cool and keep up the good work!

# Fifth day at Peyresq, continued

Tags

This post is courtesy of Pierre Borgnat, a huge thanks !

And what about Pierre Vandergheynst’s lectures ? Spread in three sessions, Pierre discussed about the recent advances in Harmonic Analysis on Graphs an Networks. The plan of the lectures was already given on the blog and I will cover a few salient points. In short, this was a marvellous series of lectures, especially given the necessary flexibility that Pierre displayed, when continuing his lecture on the black board due an electricity outage — then improvise a follow-up with both blackboard and a slides due to popular insistence!

First session on the second day, Pierre covered all basics of Spectral Graph Theory and all participants had a refresh about Laplacian matrices, spectrum of a graph, spectral cuts, spectral clustering and graph embedding using Laplacian eigenmaps. A nice example was shown for matching points clouds of a warped surface, for instance the shape of a map in point clouds — unfortunately, this would not work well if someone with his hand on the top of his head
Then, attacking the main core of the lectures, Pierre defined operators to study signals indexed on graphs: gradients, regularity, total variation. A central point is the intriguing and important notion coherence of a graph (the largest absolute value of the components of the eigenvectors of the Laplacian) — more on that later on.
Using functional calculus on graphs, it’s possible to define operators through the spectral domain so as, for instance, solve diffusion equation on a graph. The good thing is that it is now possible to give meaning to what is a filtering of graph signals (with examples of denoising or recommendation system). End of the first session.

Third day, second lecture: no electricity… And so what ? Pierre improvised on the blackboard. In the lectures, he showed that it is possible to define first operators to localise a kernel on a graphs and then how it can be interpreted as a translation so as to propose a generalisation of the convolution operator. Still, the “agonizing limit of intuition” limits some of the properties: the translation operator is not unitary, when its coherence departs from 1/\sqrt{N}. That might problematic in specific situations, or will that be an asset to study the local topology of the graphs? (we would be salved when studying graphs on manifolds or accepting a group of translation — these are developments not covered here). With all that, it was direct to define the spectral graph wavelets that provide a continuous and invertible transform using function localized both in frequency and vertices. The good points is that it’s possible to obtain a frame, even a tight frame; that they have nice localisation properties; and finally that implementation can be efficient using Chebyshev polynomials (“love Chebyshev !”). Better, a python toolbox is soon enough available… (the SGWT toolbox in matlab already exists).

Last day at Peyresq: two more hours par Pierre in which we learned how to put coarsen both signals and graphs by putting wavelets in a pyramidal algorithm. Fist ingredient: downsample the graphs using the nodal domains of the (or one of the) eigenvectors with largest eigenvalues of the Laplacian. Second ingredient: filter the signals (easy, remember that we learnt about graph filters and wavelets). Third ingredient: coarsen the graph y keeping only half of the nodal domains, and transform the graph using the Kron reduction (that should activate your memories of basic electricity). A major property is that the Kron reduction algebraically written as the Schur complement of a matrix) preserves properties of the Laplacian and it is useful to compute spline-like interpolation on a signal. This allows us to see the difference between the filtered signal retained on the coarsened graph and the original signal on the original graph (having all the vertices). With all that, you have a sound Laplacian pyramid to filter and coarsen graph signals. To make all that more efficient, the sparsification of the coarsened graphs is useful, using the work of Spielman and Srivastava (second side message of the lecture: once you now all about Chebyshev polynomials, read this paper in SIAM J. Comp. of 2011). The conclusion is that many applications about filtering on graphs, learning on graphs, visualisation of graphs, and so on, just wait in our reach. At the end of the course, it was a very motivating series of lectures.

What was skipped in the lectures? Materials were given in the slides about manifold learning with graph spectral theory or vertex-frequency analysis (in the spirit of the time-frequency analysis or Gagor transform — yet on graphs). I advise all of you to read the nice paper (on ArXiv) by David Shuman, Benjamin Ricaud and P. V. on the latter subject, and to discuss with Pierre about the former one.

# Fifth day at Peyresq

Tags

This post is courtesy of Johan and Nathanaël – a big thanks !

Today Olivier Levêque made a very nice presentation about results of
large random matrix theory applied to Multi Input Multi Output (MIMO)
antenna systems and large random networks.

He started with a short introduction to information theory, especially
to capacity of a communication channel, on the blackboard, with style.
Olivier then showed a very powerful result in the field of MIMO
communications, which states that when the number $\latex n$ of antennas
grows we get a capacity or order $\latex n$, or

$C_n = n \int_0^4 \log ( 1 + Px) q(x) dx$.

This result is powerful since it means that theoretically the capacity
grows linarly with the number of antennas (which is a significant
improvement of the previously believed limiting order

$C = log(1 + Pn^2)$

for multi antennas systems).

Olivier also enphasized that we can achieve this capacity without
increasing the power P, but simply by distributing it in a clever way,
which can be done using water-filling techniques. He concluded the first
part of the lecture by reminding that, although this result is
demonstrated for n goes to infinity, it actually works already for small
values of n, which explains the joke common amongst larse random matrix
people which states : $8 = \infty$.

The second main part of Oliver’s talk dealed with large self-organized
wireless communication networks. The main question proposed to answer
was: assuming that n users want to communicate in a given area with an
O(1) density, what is the rate $R_n$ they can hope to achieve ?
And in relation to this statement, how does the rate scale with n ?

In a setup where we wish to establish n communication channels, we use a
model where the signals that are received depend on the gain between two
nodes divided by the distance, with an additive white gaussian noise.
Note that the gain is i.i.d. and follows a complex gaussian
distribution. Finally the power of emission per user is limited and has
a given value P. For this model we would like to achieve $R_n = O(1)$

Following this model, a very naive scheme is to dispatch the n different
pairwise communication instances into n different time slots (or
equivalently n different frequency bands). In this case we achieve
$R_n = O(\frac{1}{n})$ which is really bad for n large.

The second approach uses multi-hop communication. And in this case,
Olivier explained that due to an overload of the users in the center of
the network the limiting rate is $R_n = O(\frac{1}{\sqrt{n}}$,
which is better but still not good.

Finally, Oliver presented a scheme coined Distributed MIMO
communications which proceeds in an iterative way. At each stage, a user
broadcasts its message to his neighborhood, which will then send the
message to the neighborhood of the destination user, which receives all
the messages from its own neighbors. We can then recursively use this
scheme for the communication within the neighborhood. In this case,
Oliver showed that the limiting rate for large n with k recursions is
$R_n = O(n^{\frac{-1}{k+1}})$ which tends to $O(1)$ when
k goes to infinity.

Oliver concluded his lecture by explaining a controversy about this
model, which states that considering gains to be i.i.d. is in fact
unrealistic. Some authors indeed claims that for this setup we can only
hope to have $O(2n)$ i.i.d. random variables (instead of the
$O(n^2)$ that are needed to prove the limits presented above. And
this would bring down the rate back to $O(\frac{1}{\sqrt{n}})$.
The final word provided a way to avoid this problem simply by stating
that practical networks tend to have a lower density, and that we can
achieve interesting rates given that the area is large enough.

# Fourth day at Peyresq

Tags

This morning we first had a talk by Pierre Borgnat on the use of data analytics for bike renting systems, which gives you access to various rhythms of a city.

We then all went for a super nice trekking around Peyresq, in pristine weather conditions: talk about perfection …

In the afternoon Pierre Borgnat gave a more technical two hours lecture on multiscale community detection and dynamic graphs. Pierre first convinced us that spectral clustering is not necessarily very good for community detection because there is no measure of partition quality and no reference to a null hypothesis (no community). This is why modularity was introduced. Modularity measures how much your graph departs from a null hypothesis where your graph is random with the same degrees. Optimizing modularity is difficult (NP complete?) but there exists a greedy algorithm (the Louvain method, what else?) that solves it, approximately, in polynomial time. Pierre then set on a challenging agenda. He first discussed cases where one really needs a multiscale definition of communities. In fact this is quite often the case since they tend to agglomerate at various scales and allow for rich interpretation. Borgnat used a wavelet transform transform on graphs to give an original and intuitive definition of what multiscale communities are. One of the niceties is that the method comes with a fast algorithm that scales up to fairly large graphs. Another important point stressed by Pierre is the definition of a stability criterion that allows to judge partitions. Pierre showed various applications and a second talk by Rasha Boulos even used the technique in genomics in a very interesting way. In the second part of his talk, Pierre discussed dynamical graph models. This is a very emerging domain where a lot remains to be done. Pierre wet our appetite with interesting ideas that allow to detect and track patterns in dynamical applications.

# Interlude

Tags

Among the great things at the Peyresq summer school is the wonderful atmosphere. Of course, staying in a stunningly beautiful place helps. Here is the french community packed in front the village’s only TV set, agonisingly waiting for France to score against Ecuador.

Yesterday morning was also fun thanks to a power outage that left us without beamer. I improvised my talk on the blackboard, with a little help from Patrick and Olivier.

# Third day at Peyresq

Tags

I am getting a little behind schedule, so let us start by summarizing the last talk given yesterday by Jean-François Cardoso.

Big Bang, Big Data

Jean-François gave a wonderful introduction to the Planck mission. For those who don’t know Planck, it is in my opinion one of the most amazing feat of human intelligence in this century. Planck is an ESA satellite measuring with unprecendented precision the Cosmic Microwave Background (CMB). The CMB is in some sense the very first picture of the universe: right after the big bang, the universe was a very hot plasma of charged particles in which light could not travel freely – any photon was constantly scattered by interactions with these charged particles. But roughly after 380000 years, because of expansion, the universe cooled down to a critical temperature at which electrons and nuclei can bind into neutral atoms. At that point, photons were allowed to travel uninjured and we are still receiving them today in the form of a 3K microwave radition that was first detected by Penzas and Wilson in the 60s (although it had been predicted by Gamow I believe). This CMB is an amazingly important object because it carries a lot of information about the universe in its early stage, but also about what happened afterwards. Indeed, Jean-François explained that tiny temperature fluctuations (of only 10^-5 relative amplitude) in the CMB are related to little density inhomogeneities in the plasma that are responsible, by gravitational effects, for the formation of galaxies and the structure of the universe as we know it. Simply put, this is our cosmic DNA. There are many challenges in processing the CMB as explained by Cardoso. Jean-François focused on his work that aimed at using the various frequency bands of the observations to separate the CMB from various pesky foreground contamination. Jean-François used a source separation technique that works so well it is now the de facto technique used to create CMB maps. At the heart of the technique is the use of Fourier techniques to exploit the knowledge of the CMB power spectral density and separate it from contamination.

# Second day at Peyresq

Second day: follow-up on Mallat’s talk

Let us start by summarizing, very schematically, the end of Stéphane’s talk. The idea is now that our classes do not feature simple geometric invariances (think of images of beavers). This is where things get tricky but potentially even more interesting. First, taking convolutions along geometric deformations will be simplified by switching to a Haar wavelet transform. In the Haar transform, you only need to compute averages and differences between pairs of pixels – they are taken aligned on the signal grid in the classical transform. For Haar scattering, this is generalized by allowing to apply these operations to any pair of points. The question is now how do we choose those pairs ? This seems combinatorially hopeless … Stéphane’s argument is that this pairing can be learned by imposing that we reduce space volume while maintaining inter class margins. Interestingly, but this is out of the scope of this summary, this can be turned into a convex optimization problem. Comparing the scattering transform to some of the most recent deep nets shows that with only two layers performances are comparable but with much less training complexity. The most sophisticated deep nets then add extra layers that are not exactly similar to the first ones and it is not clear what these try to achieve.

Large random matrices

Jamal Najim gave a wonderful first course on large random matrices. This was an introduction to celebrated results, like Wigner’s semi-circle law that controls the density of eigenvalues of large random symmetric matrices. Jamal then took us through the proof of the Marcenko-Pastur which provides the asymptotic density of eigenvalues for large covariance matrices. This was really a tour de force since the proof is rather complex, but Najim highlighted its architecture (a subtle mix of probability and algebra, glued with the Stieltjes transform) with great pedagogy.

# First day at Peyresq

First day : High dimensional signal analysis

Today was Stéphane Mallat’s day at Peyresq. Stéphane gave 4 hours on the topic of high-dimensional signal analysis, but his main focus was really to try to give mathematical and intuitive insights into the flabbergasting successes of deep neural networks. First Mallat gave a large but clear panorama of learning theory, covering supervised and unsupervised techniques, SVM and kernel methods. He clearly distinghuished cases where data live on small co-dimension domain of the data space, a property that can be exploited by manifold learning for instance, and problems where we are in high dimensions naturally. In the latter case, one can sometimes fall back on simpler methods when the signal to be learned is separable – the problem reduces to a series of low-dimensional problems. But often, there is no such simplifying assumption. He gave the example of the many body problems in gravitation where masses interact with each other in complex ways.

The problem in high dimensions is that we suffer the so called curse of dimensionality. Although there are many ways to understand it, the most intuitive is as follows: suppose you are to cover the domain on which your data lives with example points, so that it becomes easy to locally interpolate new, unseen, examples. The number of points you need to maintain a small distance between examples grows exponentially with the dimension of the data space. And quickly, when we are dealing with high-dimensional data, things become untractable. One solution of course is if your data lives on small dimensional subspace: the density of points can be sufficient to locally interpolate. But with no other assumption, it is not feasible to reduce the dimension. What do we do in this case ? If we were to try nearest neighbor interpolation, all neighbors of a given point would simply be too far away; this is the curse.

Mallat then explained the basic single layer neural net, showing that it is similar to approximating your data with a dictionary composed of ridge functions. A ridge function is a non-linear function applied to a simple weighted average of the data points. This shows that trying to use a single layer net basically reduces to approximation theory and one can therefore leverage this analogy to obtain a fundamental bound on learning efficiency. Essentially $\| f - f_M\| \leq c M^{-\alpha/d}$ wherre $\alpha$ controls the regularity of the function you’re trying to regress and $d$ is the dimension. You immediately see that if d is big, this result becomes meaningless…

Enter deep neural networks. In a deep net, you basically cascade the two basic ingredients of a single layer net: a linear combination of input data with weights to be learnt, composed with a non-linearity. No need to remind you here of the mind-blowing success of deep nets. The question is: why does it work so well ? The question of course is important if we also want to understand if it can fail, or if we can hope to make it even better.

Stéphane’s first ingredient is to model layers of a neural net as the composition of two contraction operators (a simple averaging operator and a non-linearity) whose ultimate goal is to reduce the volume of data. This is important: the goal is not to reduce dimensionality, but to reduce data volume, compressing points closer so that their density becomes useful for local interpolation while maintaining a sufficient margin between classes: our problem would then be to interpolate a regular function on a high-density set of points – easy !!! The idea of contraction therefore makes a lot of sense.

Stéphane’s second ingredient is to quotient out variability in the data by using representations that are invariant. It is easy to craft invariance to translation and to fixed transformation groups by means of (generalized) convolutions with filters (this is an averaging operation, it enforces invariance) and taking moduli (think of removing the phase of the Fourier transform). Let us explain this more intuitively : if we were to try to be invariant to translations only, taking the modulus of the Fourier transform would do. If however you want to be almost invariant to small deformation (diffeomorphisms) Fourier will not work : even a small diffeomorphism would generate significant perturbations at higher frequencies. Are we doomed ? Not really. Wavelets are precisely shaped to counter the increase of those perturbations at high frequencies since their frequency support gets higher with frequency. It does make a lot of sense then to use the modulus of wavelet coefficients to be invariant to both translations and small diffeomorphisms. A scattering transform is the object you get when you cascade these contracting operators.

The scattering transform is a very interesting way to explore the properties of deep networks, because you can actually understand (even if intuitively) why things work. Stéphane then used the scattering transform to tackly high dimensional classification problems. Notice that, in his construction and in stark contrast to deep nets, you don’t train anything: you use your knowledge of invariance to construct the scattering transform and then apply a simple classifier to the scattering coefficients.

This is all great when you know what transformations you want to be invariant to. But what happens if you look at categories of images ? There is no geometric regularity among images of beavers for instance. How do you maintain invariance in this case. This will be the topic of our next post🙂