Welcome to the Parallel World: NN Courese Leture 2

140亿 = 14 000 000 000

Lecture2
第一部分
overview了各种神经网络。
Feed Forward NN 用的最普遍。同时这里介绍了什么是 deep NN。也就是超过了一层 hidden layer的 FFNN。
然后重点介绍了RNN。这个估计现在依然是研究的热点（还没到大规模应用阶段）。
比较难理解的是 RNN are a very natural way to model sequential data. 咋一看，没看出 natural 来。

先说 RNN 的特点是在units的连接图中间，可能存在着圈。也就是从某个节点出发的连接可能一步步转回来连接到自己。这一点与只能“向前看”的 FFNN 区别是明显的。
接着说到，由于 RNN 有很多复杂的 dynamics ，所以训练这种网络很困难。具体的这些dynamics是什么，没有说。只是说了如何找到一个有效的训练 RNN 的方法是一个热点话题。最后说了，从生物学角度，这种 NN 更realistic一些。我理解就是跟我们的大脑更接近一些吧。
还有一句话，说 Recurrent neural networks with multiple hidden layers are really just a special case of a general recurrent neural network that has some of its hidden to hidden connections missing。我理解就是 RNN 多层隐藏层可以看做一个只含有单层隐藏层的RNN的特例（略去一些hidden to hidden 的连接）。

接着，就说到了 RNN 很 natural的model sequential data。说了两点理由。一个是 RNN 等价于一个很 deep 的 NN，而这个 NN 每一个 time slice 有一层隐藏层。每个 time slice 中那个隐藏层都有同样地权重值，每一个 time slice都接收（新的？）输入。

接着说，RNN 有这样一个本事。就是，可以remember information in their hidden state for a long time。但是问题是，very hard to train them to use this potential.

然后就举例了 RNN 的神奇。Ilya Sutskever 训练 RNN 去预测the next character in a sequence。

最后，说到了 Symmetrically connected networks SCNN。不能model cycle，但是比 RNN 更容易分析。遵循能量方程，是 SCNN “在能做什么”上面比 RNN 更有限。不包含隐藏层的 SCNN 也称为 Hopfield nets。有隐藏层的 SCNN 称为 Boltzmann machines。比 Hopfield 强，比 RNN 弱。但是，Boltzmann machines 有一个 beautifully simple learning algorithm.

第二部分
讲感知器。

先比较了统计模式识别的标准范式和标准的感知器模型。
统计模式识别
1.把原始输入向量，转换为特征向量。
2.学习如何给各个特征赋予权重，并得到一个标量值。
3.如果这个标量值大于（或小于）某个阈值则确定输入是属于某一类的样本。

感知器的模型跟此过程很相似。也是一个求权重的过程，然后利用激活函数求一个标量值。
The perceptron learning procedure is still widely used today for tasks with enormous feature vectors that contain many millions of features.

第三部分
hyperplane的概念
sigma（w*x）=0 表示了一个超平面。

第四部分
why the learning works?

generously feasible region 和 feasible region的区别
对于这个问题，有人在课程论坛上发了这么个帖子。
“
It's not clear to me why the introduction of a generously feasible solution changes the informal proof in Lecture 2d from incomplete to complete. I understand that without a generously feasible solution and a minimum decrement the weight vector may move away from the feasible weight vector(s) during some iterations, but if a solution exists won't it converge eventually? I can see how a minimum decrement reduces oscillation before convergence, but I don't see why it changes the proof from impossible to possible. I'm assuming decrements are not infinitesimally small.
”
对这个提问的一个回答很好。

“
There is no change in the algorithm, so it is not a change in the behavior of the weight vector, just in whether this proof technique works.

The change is from "D starts at 50. If D gets to 10 we win. D either decreases by 1 or increases by 1 at each step." Then you haven't proved D can't oscillate 40-41-40-41 forever, for example.

Using a generously feasible vector changes it to "D starts at 100. If it gets to 10 we win. D decreases by 1 at each step."
”

第五部分
limitation 来自所依赖的 features

still lots of to do even if you don't learn features.
e.g. hand define a huge number of features

group invariance theorem

so the tricky part of pattern recognization must be solved by the hand-coded feature detectors, not the learning procedure.

the longer term conclusion is that NNs are only gonna be really powerful if we can learn the feature detectors. Its not enough just to learn weight sum of feature detectors, we have to learn the feature detectors themselves.

more layers of linear units do not help. Its still linear.

Fixed output non-linearities are not enough.
what we need is multiple layers of adaptive non-linear hidden units. And the problem is how can we train such
NN.

Minsky and Papert's "Group Invariance Theorem" says that the part of a Perceptron that learns cannot learn to do this if the transformations form a group.
上句的this指的是（存在形式变换的）模式识别任务。

总结
这一课的难点，个人觉得在于理解 “feasible region” 与 “generously feasible region” 的区别。理解这个，是为了理解Lecture 2d的proof是否成立。

Welcome to the Parallel World

Friday, October 26, 2012

NN Courese Leture 2

No comments:

Post a Comment