This week I'm reading about a type of artificial neural network called a Long Short Term Memory Networks (LSTMs for short), which I've heard about a number of times but never actually learned what they are beyond the name. The tutorial I'm reading comes highly recommended by my friend Google (I searched LSTM tutorial).
The details are useful, and the idea of explicitly having memory management in an RNN makes a lot of sense. The basic idea is that if you want your artificial neural network to have memory of stuff that happened in the past, then you should specifically arrange it so that it has that property. Interestingly, just taking some outputs and plugging them back in as inputs (a standard RNN) is not able to learn to represent long-term memories. From my perspective, the idea to structure your network to have a desired property is the same rough idea as how Convolutional Neural Networks apply the same operation over all of visual space, since we expect that vision in the middle of the image should be pretty similar to vision on the edges.
Anyway, the tutorial does a great job of explaining how they work, so I won't repeat it here. I'm curious what advances have happened since 2015 when it was written. I know people have already tried applying 'attention' to networks, as mentioned in the conclusion, though I don't know how well those work or how well they mimic the brain's version of attention. Certainly no one has found an LSTM module in the brain yet, so I doubt anyone adding something on to these networks is aiming too much for biological plausibility.
One thing I struggle with when dealing with these kinds of learning algorithms is that its very difficult to tell (without playing around with them for a while) which kinds of changes to the architecture are going to make substantial changes in how the network behaves. The tutorial has a few variants that seem to conceptually allow the network to do the same thing. Do these differences matter substantially? It's likely they do, but it's certainly not clear how exactly. I'd love to know if anyone has any decent ways to figure an answer to that kind of question. Is the answer just a shotgun approach? Throw your algorithm/structure at everything and see what works? Are there any good ways to visualize the loss function that might provide insights? For example, are there classes of problems where the loss function is a spiky mess full of local minima?
I'll continue to ponder all of this, but I'd love to get some input if you have any to give.
No comments:
Post a Comment