Monday, November 3, 2014

Neural Turing Machines

Graves, A., Wayne, G., Danihelka, I. (2014) Neural Turing Machines. ArXiv

Paper from Deep Mind.

"Computer programs make use of three fundamental mechanisms: elementary operations
(e.g., arithmetic operations), logical flow control (branching), and external memory, which
can be written to and read from in the course of computation (Von Neumann, 1945)"

"Recurrent neural networks (RNNs) stand out from other machine learning methods
for their ability to learn and carry out complicated transformations of data over extended
periods of time. Moreover, it is known that RNNs are Turing-Complete (Siegelmann and
Sontag, 1995), and therefore have the capacity to simulate arbitrary procedures, if properly
wired."

"While the mechanisms of working memory remain somewhat
obscure at the level of neurophysiology, the verbal definition is understood to mean
a capacity for short-term storage of information and its rule-based manipulation (Baddeley
et al., 2009 [this is a textbook]). In computational terms, these rules are simple programs, and the stored
information constitutes the arguments of these programs. Therefore, an NTM resembles
a working memory system, as it is designed to solve tasks that require the application of
approximate rules to “rapidly-created variables.” Rapidly-created variables (Hadley, 2009)
are data that are quickly bound to memory slots, in the same way that the number 3 and the
number 4 are put inside registers in a conventional computer and added to make 7 (Minsky,
1967). An NTM bears another close resemblance to models of working memory since the
NTM architecture uses an attentional process to read from and write to memory selectively."


"Fodor and Pylyshyn also argued that neural networks with fixedlength
input domains could not reproduce human capabilities in tasks that involve processing
variable-length structures. In response to this criticism, neural network researchers
including Hinton (Hinton, 1986), Smolensky (Smolensky, 1990), Touretzky (Touretzky,
1990), Pollack (Pollack, 1990), Plate (Plate, 2003), and Kanerva (Kanerva, 2009) investigated specific mechanisms that could support both variable-binding and variable-length
structure within a connectionist framework."

"A crucial innovation to recurrent networks was the Long Short-Term Memory (LSTM)
(Hochreiter and Schmidhuber, 1997). This very general architecture was developed for a
specific purpose, to address the “vanishing and exploding gradient” problem (Hochreiter
et al., 2001), which we might relabel the problem of “vanishing and exploding sensitivity.”"



So the idea is that there is a memory matrix, called M_t, which a neural network is effectively using as RAM. M_t is NxM, where N is the number of memory locations, and M is the vector size of each location.

It seems like they are using the semantic pointer ideas of Eliasmith, and they cite his book.

w_t is a vector of weightings over N locations. This is normalized to unit length.
r_t is a length M read vector

To write to memory, there are two steps, an erase and an add step.
e_t: is M element erase vector
a_t: is M element add vector


Content-based addressing vs. location based, they employ both. They say content is more general than location, but use location for simplicity and generalization.

To focus by content, the read/write heads produce a length M key vector k_t that is compared to each M_t(i) by a similarity measure. \beta_t is the key strength.


For the location-based addressing, the idea is that the weighting goes through a rotational shift. So if a weighting was focused solely on one memory location (i.e. w = [0 1 0 0 ... 0]) then a rotation of 1 would move it to the next location (i.e. w= [0 0 1 0 ... 0]).


Ok, so then they use a neural network to control the content addressing. And they compare a feed-forward NN with an LSTM NN with M, and an LSTM by itself.

The first task is a copy task, where the network (they say) has learned to implement this algorithm using some sort of gradient descent:

"We believe that the sequence of operations performed by the network can be summarised by the following pseudocode:

initialise: move head to start location
while input delimiter not seen do
  receive input vector
  write input to head location
  increment head location by 1
end while
return head to start location
while true do
  read output vector from head location
  emit output
  increment head location by 1
end while

This is essentially how a human programmer would perform the same task in a low-level programming language."

They next perform a repeat copy task where the NTM is told a pattern and told to repeat the given pattern a number of times. NTM does much better than LSTM.


[I'm wondering about this weighting matrix, in this case they seem to just be using a binary weighting statement, essentially accessing a single column of M with each weighting. But it seems like the weighting could be any vector and that M could have a distributed memory]


The next task requires the network to jump around its memory in order to perform (as opposed to following a sequence).



"This implies the following memory-access algorithm: when each item delimiter is presented, the controller writes a compressed representation of the previous three time slices of the item. After the
query arrives, the controller recomputes the same compressed representation of the query
item, uses a content-based lookup to find the location where it wrote the first representation,
and then shifts by one to produce the subsequent item in the sequence (thereby combining
content-based lookup with location-based offsetting)."

Then they show a couple more algorithms, and a sorting algorithm.