Graves, A., Wayne, G., Danihelka, I. (2014) Neural Turing Machines. ArXiv
Paper from Deep Mind.

"Computer programs make use of three fundamental mechanisms: elementary operations

(e.g., arithmetic operations), logical flow control (branching), and external memory, which

can be written to and read from in the course of computation (

Von Neumann, 1945)"

"Recurrent neural networks (RNNs) stand out from other machine learning methods

for their ability to learn and carry out complicated transformations of data over extended

periods of time. Moreover, it is known that RNNs are Turing-Complete (

Siegelmann and
Sontag, 1995), and therefore have the capacity to simulate arbitrary procedures, if properly

wired."

"While the mechanisms of working memory remain somewhat

obscure at the level of neurophysiology, the verbal definition is understood to mean

a capacity for short-term storage of information and its rule-based manipulation (Baddeley

et al., 2009 [this is a textbook]). In computational terms, these rules are simple programs, and the stored

information constitutes the arguments of these programs. Therefore, an NTM resembles

a working memory system, as it is designed to solve tasks that require the application of

approximate rules to “rapidly-created variables.” Rapidly-created variables (

Hadley, 2009)

are data that are quickly bound to memory slots, in the same way that the number 3 and the

number 4 are put inside registers in a conventional computer and added to make 7 (Minsky,

1967). An NTM bears another close resemblance to models of working memory since the

NTM architecture uses an attentional process to read from and write to memory selectively."

"

Fodor and Pylyshyn also argued that neural networks with fixedlength

input domains could not reproduce human capabilities in tasks that involve processing

variable-length structures. In response to this criticism, neural network researchers

including Hinton (

Hinton, 1986), Smolensky (Smolensky, 1990), Touretzky (Touretzky,

1990), Pollack (

Pollack, 1990), Plate (Plate, 2003), and Kanerva (

Kanerva, 2009) investigated specific mechanisms that could support both variable-binding and variable-length

structure within a connectionist framework."

"A crucial innovation to recurrent networks was the Long Short-Term Memory (LSTM)

(

Hochreiter and Schmidhuber, 1997). This very general architecture was developed for a

specific purpose, to address the “vanishing and exploding gradient” problem (

Hochreiter
et al., 2001), which we might relabel the problem of “vanishing and exploding sensitivity.”"

So the idea is that there is a memory matrix, called M_t, which a neural network is effectively using as RAM. M_t is NxM, where N is the number of memory locations, and M is the vector size of each location.

It seems like they are using the semantic pointer ideas of Eliasmith, and they cite his book.

w_t is a vector of weightings over N locations. This is normalized to unit length.

r_t is a length M read vector

To write to memory, there are two steps, an erase and an add step.

e_t: is M element erase vector

a_t: is M element add vector

Content-based addressing vs. location based, they employ both. They say content is more general than location, but use location for simplicity and generalization.

To focus by content, the read/write heads produce a length M key vector k_t that is compared to each M_t(i) by a similarity measure. \beta_t is the key strength.

For the location-based addressing, the idea is that the weighting goes through a rotational shift. So if a weighting was focused solely on one memory location (i.e. w = [0 1 0 0 ... 0]) then a rotation of 1 would move it to the next location (i.e. w= [0 0 1 0 ... 0]).

Ok, so then they use a neural network to control the content addressing. And they compare a feed-forward NN with an LSTM NN with M, and an LSTM by itself.

The first task is a copy task, where the network (they say) has learned to implement this algorithm using some sort of gradient descent:

"We believe that the sequence of operations performed by the network can be summarised by the following pseudocode:

initialise: move head to start location

while input delimiter not seen do

receive input vector

write input to head location

increment head location by 1

end while

return head to start location

while true do

read output vector from head location

emit output

increment head location by 1

end while

This is essentially how a human programmer would perform the same task in a low-level programming language."

They next perform a repeat copy task where the NTM is told a pattern and told to repeat the given pattern a number of times. NTM does much better than LSTM.

[I'm wondering about this weighting matrix, in this case they seem to just be using a binary weighting statement, essentially accessing a single column of M with each weighting. But it seems like the weighting could be any vector and that M could have a distributed memory]

The next task requires the network to jump around its memory in order to perform (as opposed to following a sequence).

"This implies the following memory-access algorithm: when each item delimiter is presented, the controller writes a compressed representation of the previous three time slices of the item. After the

query arrives, the controller recomputes the same compressed representation of the query

item, uses a content-based lookup to find the location where it wrote the first representation,

and then shifts by one to produce the subsequent item in the sequence (thereby combining

content-based lookup with location-based offsetting)."

Then they show a couple more algorithms, and a sorting algorithm.