This looks big. Basically its a neural network that learned how to communicate in English from scratch. Mind you, its no Shakespeare and frankly it can only answer relatively simple queries. But fact that it can now respond logically to questions like: "Is mother older than son?" When provided with information like: "Mother is older than father, father is older than son." Puts it closer to Turing test capable than any other AI.

Link to original paper
Hmmm. Neural gating mechanisms, check. Executive Neural architecture making decisions and running subnetworks and symbolic machinery, check. At first I thought this was somebody I know, because I've been batting these ideas (and others) around with this guy for months. But it's a team of people and they're publishing from Texas A&M, not Stanford.

It looks like they did a lot of very complicated (and very successful! ) fixed design to make a learning system that is very specifically designed to learn language. I'm working on something a bit more drastically general in terms of self-evolving neural structural adaptation, but I didn't expect my creation to be able to handle language as well as this one does.

But, hmmm. It looks like every bit of that carefully designed system can be modeled as a state of the structure in terms of the structurally-adaptable neural net software that I've already written. So if I turn my 'self-mutation' bits off I could probably reproduce this result.

At the moment I'm bogged down in writing consistency checks so it doesn't self-destruct (again) the next time I turn the structural-adaptation functionality on. But what they've done is mostly made out of the same parts I already have software bits for, working the same way I put them working together.

This is SO COOL! This means I'm not completely crazy, and these parts actually WORK TOGETHER! This thing working this well means that what I'm doing can work too.

It also means I better light a fire under my butt, or somebody's going to beat me to publication.

By the way... Here are some "freebies" for everyone who wants to work with deep neural network architectures. I'm not getting into my mad-science stuff here, I'm just letting people know there's been a complete revolution in what we can do with Neural Networks in the last ten years, and you can get most of the benefits of it with three simple changes to your code.

First, DON'T USE THE SIGMOID CURVE 1/(1+exp(-x)) as an activation function! It can't be used successfully to train anything more than 2 hidden layers deep. Don't use ANY activation function with exponentially decaying tails, which takes in all the "popular" sigmoids from a dozen years ago, including atan and tanh as well, because while those can be used with G-B initialization for training a 3-layer-deep network or sometimes even a 4-hidden-layer network, they aren't reliable for 4, and aren't usable for anything more.

Instead use an activation function with subexponential tail convergence such as 1/(1 + abs(x)) and, with G-B initialization, (and, well, a lot of computer time) you can successfully train a network 6 or 8 hidden layers deep, just using standard backprop.

Initialization, and a sigmoid function with subexponential tail convergence, are more important than we realized until just about six years ago because if you don't do initialization exactly right, or use the wrong sigmoid, the nodes on the lower layers saturate before the network ever converges (and therefore it will usually converge on one of the crappiest local minima that exists in your problem space)

Careful initialization on most networks means using the Glorot-Bengio scheme - connections between two layers with n nodes and m nodes should be initialized to random values between plus and minus sqrt(6)/(sqrt(n+m)).

But if the networks are wide as well as deep, then even Glorot-Bengio initialization is not your friend due to the law of large numbers, which means you have essentially no gradients to work with at higher layers because inputs are so mixed and everything is so averaged out through all the layers, that there's not sufficient gradient to work with. There's a modification that means calculating the initialization as if one of the layers has fewer nodes, and then hooking up every node in the other layer to just that many randomly selected nodes of that layer. This can leave most of your connections with zero weights and that's okay.

Hmm, what else? Dropout training allows you to build complex models with good generalization, and also avoids overfitting the training data. Simply put, every time you present a training example, randomly select half the nodes in the network and pretend that they don't exist. Double the outputs on the ones that you're using, and run the example. Then do backpropagation on just the nodes you actually used. The results are dramatic.

There are a lot of tougher and more complicated tricks, but just this much - G-B initialization, a sigmoid having subexponential tails, and dropout training - allows ordinary backprop training to reach a hell of a lot deeper, and generalization to work a hell of a lot better, than we ever thought it could a dozen years ago.

Oh, what the hell. Here's another freebie - and while this one *IS* from my mad science, the results are very repeatable by sane scientists. I call it the Magic Sigmoid.

The activation function for the magic sigmoid is x/(1 + abs(x + cuberoot(x)))

This one isn't used by hardly anybody because training is very slow. But I adore it because its tail convergence is not just subexponential, it's subhyperbolic. The reason it's slow is because it trains the upper layers of weights at only about the same rate as the deepest layers of weights. Whereas G-B initialization is balanced on a razor's edge of stability with 1/(1+abs(x)), it is firmly within a broad range of stability for the magic sigmoid, meaning that if you have the time you can train networks to ANY depth. But, um, yeah, slow steady progress or not, that is a lot of time.