The main difference between the predictive-coding model (PC) and the adaptive-resonance model (ART) is that PC is based on negative feedback of prediction errors while ART is based on a positive-feedback resonance. These are very different ideas about how feedback is modulating responses, but they may be reconcilable. Each method has support - PC is Bayesian and it seems that feedback is overall negative. ART predicts bursting and all the pyramidal cell synapses are excitatory.
So how might they be reconcilable? The positive feedback system in ART could turn out to be overall negative, since the responses are normalized. I'm still trying to reconcile the effects of positive feedback signals and how they might be used. Grossberg compares the top-down signals to attention, and the attention literature suggests that attentional modulations effect the gain of responses. So does this mean activation of the apical tuft via top-down signals results in a scaling of the ultimate IO function? Do different neurons then receive different amounts of scaling depending on the feedback signals?
One way of looking at it is that the top-down layer (2nd layer) develops a normalized population code just like the first layer. The 2nd layer then sends back a normalized response vector to the 1st layer. If the top-down signal was perfect at matching the bottom-up signal, and these signals were multiplicative, then it would be as if you were squaring the first-layer. The population would go from x to x^2 (after re-normalization). This means 2/3 of the neurons will be inhibited and 1/3 will be excited. This may lead to some weird effects in a steady-state, as the population-code will change. This would then change the 2nd layer and then further alter the first layer.
What if it was additive instead? The second layer sends back to the first-layer the same normalized vector that the first layer was producing. This would be like multiplying the first-layer by 2, which after re-normalization would lead the first layer back to the same state. This seems better, population-code is maintained, and the well-predicted first layer doesn't change. This could also look like multiplication in the grand-scheme.
Imagine that the goal of learning is for the top-down layer to send back to the first layer the same population vector. This means that differences in the population vectors would lead to some form of plasticity. If a neuron received more top-down input than bottom-up input, then the top-down synapses should get weaker. If it received more bottom-up input than top-down then the synapses should get stronger.
Layer 4 is then the data layer, Layer 2/3 is a classification layer. These interact as described (2/3 is trying to predict 4). L2/3 is chunky - imagine L4 as data points in a high-dimensional space, L2/3 are the boundaries that classify these points. Sometimes different classifcation boundaries overlap, if there is no extra evidence then L2/3 goes through hypothesis testing, cycling through different possible classifications. Higher inputs or lateral inputs can favor one hypothesis over another.
Perhaps another layer is somehow a parameterization layer (maybe L6 or L5?). This layer describes the transformation of the classification back to the data. As in it is like the principal component scores of the clusters. Lets imagine language since it is a good hierarchical system. So this part of cortex is classifying the word "grape". The data layer gets the input sounds, and one part of it is representing the a sound. Imagine that the a sound has a lot of variability - it can be said quickly or stretched out. L2/3 classifies it as a, and L6 describes the amount of quickness or stretchiness of the a sound. This helps remap the classification back to the data, and describes a parameterization of the classification.
L4 receives data and sets up a population vector that is equivalent to a point in high-dimensional space. L2/3 creates a cluster (more lateral connections, perhaps is even binary activity). If the L4 point is within 2 clusters, then the clusters will turn on and off based on the probabilities of each (over time). L6 then describes the data based on the clustering - as in like the principal components of each cluster. (I'm not sure if this is necessarily L6, but it seems like the PCA parameterization of the clusters would be useful somewhere).
There may not need to even be a different layer. The cluster is like the 0th principal component (describing the mean of the data). So you could imagine how some neurons in L2/3 are calculating the 0th component, some are calculating the first etc. L2/3 could just be the low dimensional parameterization.
No comments:
Post a Comment