Wednesday, April 17, 2013

ICA

The motivation behind developing ICA can be illustrated by a classic problem -- the "Cocktail Party" problem. The basic idea is that you are at a big party and you hear the voices of many different people talking. Your ears are picking up the sum of all of these people's voices and your brain has the task of separating out each of the voices. Each ear receives a slightly different signal -- say your left ear is closer to speaker 1, and the other ear is closer to speaker 2. The left ear would receive a stronger signal from speaker 1 and a weaker signal from speaker 2. Mathematically each ear is just receiving a weighted sum of all of the signals.
x = M*s
So x is what your ears pick up, s are the speakers at the party, and M describes how the signals from each of the speakers get mixed together to make the observed signal. This is the "Mixing Matrix".

The assumptions about the problem that we are trying to solve is that the signals that we observe are actually mixtures of signals that are independent -- the speakers talking at the party are not influencing each other. So how do we pull out the independent signals from the observed data in x? What exactly do we mean by independence?

There are some strict probabilistic definitions of independence -- namely p(a,b) = p(a)*p(b). These p's describe the full probability distributions of the signals, and independence means that the chance of observing a and b together is equal to the product of observing each of them independently. If this holds true across the entire probability distribution, then the signals are independent. However, this definition is almost useless for any realistic application, because it would require accurately estimating the complete distributions and the complete joint distribution of the inputs. 

Correlation is something that you could easily calculate practically, but no correlation does not necessarily imply independence. Independence, however, does imply that no correlation. A sine wave and a cosine wave plotted against each other would create a circle -- data in the shape of a circle would clearly be uncorrelated. However, these data are obviously not independent.

A big intuition about how to determine independence comes from the Central Limit Theorem. This is one of the most fundamental ideas in probability and statistics. The central limit theorem states that the summation of many independent random events will converge towards a gaussian. It doesn't matter what the probability distribution of the individual events are -- they can be uniform, binary etc, but adding many independent elements together will produce a Normal distribution. A nice example is rolling dice. If you roll one die, then there is a uniform probability that it will land on one of the numbers 1-6. If you roll two die, then the sum of the die no longer has a uniform probability -- the sum is most likely to be 7, and least likely to be 2 or 12. As you add more and more dice the probability distribution gets closer and closer to looking like a bell curve -- a Normal/Gaussian distribution. The Central Limit Theorem is why a Normal distribution is called Normal -- we see these distributions all over the place in the real world because independent events when summed together take this shape. 

For ICA we are assuming that several independent components are being mixed together to create our observed signals. The Central Limit Theorem states that if we mix independent components then we get something that is more Normal than the original components. Thus the goal of ICA is to find a projection of the mixed signals that is the LEAST Gaussian or the most non-gaussian. ICA thus becomes an optimization problem where we are trying to find a projection of the data that gives us the most non-gaussian components.

In order to solve the optimization problem, we must define what we mean by non-gaussian -- how do we measure non-gaussianity? There are several possible ways to estimate the gaussianity of data, and each have different trade-offs. Estimators can be optimal but require a lot of data or they can be efficient but biased. The field is not fully set on the "best" way of estimating non-gaussianity, but there are a few ways of doing it which stem from probability theory and information theory. In reality these different metrics are actually quite similar, and probability theory and information theory are actually mathematically linked -- several studies have shown how estimators derived from probability theory are actually identical to some estimators derived through information theory. There is in fact an underlying mathematical structure that links both fields.

One of the most prominent and conceptually simple ways of measuring non-gaussianity is to use the "Kurtosis". The kurtosis is related the 4th order moment of a probability distribution (the first is the mean, the second is the variance, the 3rd is known as skewness). The kurtosis is the peakiness of the distribution. There are infinitely more moments, and all are defined by raising the expected data to different powers (the nth moment is E(x^n), where E is the Expectation operator). Kurtosis is technically the "excess" of the 4th moment, and is usually compared to the Kurtosis of a Gaussian distribution. A Gaussian distribution has a 4th moment of 3-sigma^4 (where sigma is std). The kurtosis is defined relative to a Normal distribution with variance of 1, and so you just subtract out the kurtosis of a Normal distribution with the same variance as the data -- K = E(x^4) - 3. 



No comments:

Post a Comment