Tuesday, November 5, 2013

network, milliseconds

The current state-of-the-art AI is based on "deep belief networks". These are neural models based on brain inpsired ideas, and are now beginning to be used effectively for AI. Part of the reason seems to be just some learning tricks ( like not minimizing the error function completely), and part is that computing power has become sufficient to make large and deep networks.

One of the big components is the "max pooling", which is modeled off of complex cells in visual cortex. The idea is that the simple cells respond to a convolutional filter -- like the gabor set of V1, or the ICs. And the complex cells listen to a range of simple cells and respond as the maximum of their inputs.

Then they build hierarchies of these networks. Convolutional filter, max pooling, convolutional filter max pooling. This is building a hiearchical component decomposition of the stimulus, which allows for local invariances of the features at different feature levels. With these types of networks they can now achieve human level performance on certain types of visual recognition tasks.

These advances are brilliant and I think are very close to part of the way the brain works. These type of deep belief networks don't have any feedback loops, but clearly the brain is full of them. What these networks seem to be is the "feed-forward" pathway of the cortical hierarchy. Well not really the cortical hiearchy, but a hierarchy that they learned via some derived learning rule. Essentially, their machine is like a human that is really well trained at recognizing a set of objects. Except the task isn't like a human staring at a picture, rather it is like the image flashes on the screen for 100 ms and you have to respond as fast as possible. I imagine with a lot of training a human could do this task as effectively as the deep belief nets.

I think this hits on an essential aspect of what the cortical hierarchy is doing. It is forming a component decomposition of the incoming sensory space. Now, the complex cells may be doing something kind of like max-pooling, but there could be much more sophisticated ways of pooling. If we imagine the visual hierarchy then thalamic inputs are sending in sensory signals into L4 of V1. Here, the basic "pixels" of the sensory world are being represented in a overcomplete feature space. This is analogous to the first simple cell layer of these deep belief networks.

In cortex, L4 seems to mainly project to L2/3. L2/3 characterized by many recurrent connections, it is close to the top-down input, and thought to be where the complex cells are. I'm not a fan of max pooling as what L2/3 is doing per se. I think that max pooling is just something that seems to happen because of both the way we study it and because it is doing something a little bit related. I feel like it is the "model" of what you see. L2/3 is like what you see when you imagine seeing it. So when you actually see something both L4 and L2/3 are going to be active -- as you are simultaneously seeing and classifying/modeling what you see. L2/3 seems to be just locally max-pooling, but it is probably just less sensitive to the local features and pooling data from more distal sources.

So the deep belief networks model the feed-forward pathway. Low-level features in V1, texture features in V2, shape features in V4, motion in MT. The higher level areas were likely shaped by evolution to be particularly good at making their transformations. The nervous system is so amazingly flexible, that the mechanisms for the computations up the feed-forward pathway could be very diverse. But the main idea would be looking for particular correlations in the features.

The brain seems to not be a simple vertical hierarchy as many of the deep belief nets work, but it kind of fans out and then comes back together at the top. Further, the brain utilizes every level of the feature space -- whereas these deep belief nets seem to just rely on the output of the top (mainly because they are trying to make a classification). There is the what and the where pathway -- an object-centric spatially invariant classification system, and a relational, predictive, operational centers bound together at various levels in the hierarchy.

The what pathway is like the object consciousness. When you look at anything, say my dog suki, your brain is representing information across the cortex simultaneously. V1 is representing the low-level sensory information -- the color of suki's hair, her outline, the slits in her close eyes. V2 is unifying the hair as a contiguous texture, V4 is outlining her dimensions creating a 3D model of her shape that you can just feel. MT picks up on the subtle motions of her breathe, and notices her ears flicker.

Further down the what pathway more "higer-level" features seem to be represented. Face patches respond to different types of invariances -- regions for profiles, regions for frontal face. My what pathway connects me to the knowledge that I'm looking at a dog, dogs have these typical shape. Higher up I know that this is a particular dog. Higher up I associated my daily memories with suki. Every time I look at suki this information cascades up in a few 100 milliseconds and becomes accessible to the rest of my brain.

No comments:

Post a Comment