Welcome to the Parallel World: October 2012

Monday, October 29, 2012

NN Course Lecture 3

第一个部分
学习线性神经元的权值
1.第一个问题
在开头介绍本节课内容时，先讲了线性神经元与感知器很相似。接着说，其中区别在于感知器力求权值集在不断迭代过程中，接近理想值。而线性神经元力求输出不断接近理想值。

2.第二个问题
谈到的是为啥感知器学习算法不能用于含有隐藏层的NN。
原因，含有隐藏层的NN 的权值（向量）组成的集合不一定是凸（convex）的。
指出原因之后，又给出两个小结。
And to prove that when they're learning something is improving, we don't use the same kind of proof at all.
一个是，在证明学习在“进步”的方式上，不再采用证明感知器的方法。
另一个是，多层NN不使用感知器学习算法。也不存在多层感知器的概念，说到感知器就是指只有输入层和输出层的那种简单NN。

3.对于多层 NN，需要一个不同于感知器的方法来让学习 “演进”。

4.以输出是否不断接近理想值来判断学习是否在进步。这种判断方法对于那些solution set不是凸集的问题也是适用的。这和感知器学习是不同的。在感知器学习中，即便被训练的权值越来越接近理想的权值集，输出作为一个整体可能会不断远离理想值。最后引出 a linear neuron with a squared error measure.

5.linear neuron 在电子工程领域也被成为 linear filter。

6.输出值是一个实数值，用线性表达式累加值表示。

7.用squared distance来表征输出值与目标值的差距。称之为error。error是所有训练样本的误差平方和。

8.线性神经元的学习目标是不断减小此error。

9.第三个问题
为啥不能用分析的方法(分析法，就是解方程法吗？是不是更应该翻译成解析法?)解决此学习问题。（为每个输入case建立一个方程然后解方程。）
1）第一种回答：scientific answer，希望采用一种人脑神经真正采用的方法。人脑也许不会是靠解一堆符号化的方程来学习的。
2）第二种回答：engineering answer，希望找到一种能解决多层，非线性 NN 学习问题的通用方法。分析型的方法依赖此问题必须满足：(1)是线性的（方程可解，且容易解）和（2）以平方距离作为 “差距”（或错误）衡量手段。但是基于迭代的方法虽然效率上低一些，但是却更容易推广到多层，非线性等这些更复杂的场合。

10.学习过程中，只是关注 output与target 越来越接近，不保证某个weight值会在某几次迭代中，离理想值（正确值）越来越远。

11.第四个问题
迭代学习过程是否可以得到正确的解？也许没有完美解。如果将学习率设置得足够小（也就是学习的歩子放得很小），就可以尽可能的接近最好的解。

12.第五个问题
收敛的速度问题。如果输入的维度之间是高度相似的，则会很慢。比如，输入中的ketchup和chips的数量总是一样的，就很难决定两者的价格。

13.online delta rule 和用于感知器学习的规则的区别
在感知器学习中，只是在感知器犯错的时候，更新weight值。而在 online delta rule 是每一步（每一次迭代）都变化。这也是引入学习率因子的原因。

第二部分
Lecture 3b The error surface for a linear neuron
1.线性 NN 的error表面，在weight空间基础上扩展出来。

2.可以想象一个3维的空间，平面两个坐标分别是两个weight值 w1 和 w2，垂直向上的是 E。
这样，纵截面切下来得到的是一个抛物线。而横截面与error surface相切得到的是椭圆。

3.比较在线学习和批量学习的区别
（一种最简单的）batch learning走的是垂直于表面切线方向的方式。（steepest descent）
而“在线学习”走的是一种之字形路线。行进的路线由约束曲线（由输入集合决定）决定。（zig-zags around the direction of steepest descent）

4.细长形的 weight 椭圆会造成学习过程非常慢

第三部分
Learning the weights of a logistic output neuron
1.介绍对数输出的神经元（S函数作为激活方程）
2.delta rule x*(t-y)
3.external term y(1-y), this is so-called "slope of logistic".

第四部分
The backpropagation algorithm
此部分介绍如何学习多层网络。

1.问啥需要隐藏层？
very limited in input-output mapping they can model.

2.方法采用 persturbing weights

3. what's ‘activity’?

第五部分
How to use the derivatives computed by the backpropagation algorithm
1.Converting error derivatives into a learning procedure
后向传播是一种计算 error差分 dE/dw 的有效的方法。
但是如何使用这写计算出来的误差差分值有学问。考虑两方面的问题：一个是优化问题 How do we use the error derivatives on individual cases to discover a good set of weights? 另一个是泛化问题 How do we ensure that the learned weights work well for cases we did not see during training?

2.优化问题又包括两个方面
（1）How often to update the weights
（2）How much to update

3.在泛化问题中，要避免过拟合。
If the model is very flexible it can model the sampling error really well. This is a disaster.

从上面看出，讲过度拟合是在说到泛化问题时讲的。因为过度拟合会把采样错误也拟合得很好，导致模型的不正确，进而导致无法泛化。
避免过度拟合的方法有以下这些。
– Weight-decay
– Weight-sharing
– Early stopping
– Model averaging
– Bayesian fitting of neural nets
– Dropout
– Generative pre-training

Friday, October 26, 2012

NN Courese Leture 2

140亿 = 14 000 000 000

Lecture2
第一部分
overview了各种神经网络。
Feed Forward NN 用的最普遍。同时这里介绍了什么是 deep NN。也就是超过了一层 hidden layer的 FFNN。
然后重点介绍了RNN。这个估计现在依然是研究的热点（还没到大规模应用阶段）。
比较难理解的是 RNN are a very natural way to model sequential data. 咋一看，没看出 natural 来。

先说 RNN 的特点是在units的连接图中间，可能存在着圈。也就是从某个节点出发的连接可能一步步转回来连接到自己。这一点与只能“向前看”的 FFNN 区别是明显的。
接着说到，由于 RNN 有很多复杂的 dynamics ，所以训练这种网络很困难。具体的这些dynamics是什么，没有说。只是说了如何找到一个有效的训练 RNN 的方法是一个热点话题。最后说了，从生物学角度，这种 NN 更realistic一些。我理解就是跟我们的大脑更接近一些吧。
还有一句话，说 Recurrent neural networks with multiple hidden layers are really just a special case of a general recurrent neural network that has some of its hidden to hidden connections missing。我理解就是 RNN 多层隐藏层可以看做一个只含有单层隐藏层的RNN的特例（略去一些hidden to hidden 的连接）。

接着，就说到了 RNN 很 natural的model sequential data。说了两点理由。一个是 RNN 等价于一个很 deep 的 NN，而这个 NN 每一个 time slice 有一层隐藏层。每个 time slice 中那个隐藏层都有同样地权重值，每一个 time slice都接收（新的？）输入。

接着说，RNN 有这样一个本事。就是，可以remember information in their hidden state for a long time。但是问题是，very hard to train them to use this potential.

然后就举例了 RNN 的神奇。Ilya Sutskever 训练 RNN 去预测the next character in a sequence。

最后，说到了 Symmetrically connected networks SCNN。不能model cycle，但是比 RNN 更容易分析。遵循能量方程，是 SCNN “在能做什么”上面比 RNN 更有限。不包含隐藏层的 SCNN 也称为 Hopfield nets。有隐藏层的 SCNN 称为 Boltzmann machines。比 Hopfield 强，比 RNN 弱。但是，Boltzmann machines 有一个 beautifully simple learning algorithm.

第二部分
讲感知器。

先比较了统计模式识别的标准范式和标准的感知器模型。
统计模式识别
1.把原始输入向量，转换为特征向量。
2.学习如何给各个特征赋予权重，并得到一个标量值。
3.如果这个标量值大于（或小于）某个阈值则确定输入是属于某一类的样本。

感知器的模型跟此过程很相似。也是一个求权重的过程，然后利用激活函数求一个标量值。
The perceptron learning procedure is still widely used today for tasks with enormous feature vectors that contain many millions of features.

第三部分
hyperplane的概念
sigma（w*x）=0 表示了一个超平面。

第四部分
why the learning works?

generously feasible region 和 feasible region的区别
对于这个问题，有人在课程论坛上发了这么个帖子。
“
It's not clear to me why the introduction of a generously feasible solution changes the informal proof in Lecture 2d from incomplete to complete. I understand that without a generously feasible solution and a minimum decrement the weight vector may move away from the feasible weight vector(s) during some iterations, but if a solution exists won't it converge eventually? I can see how a minimum decrement reduces oscillation before convergence, but I don't see why it changes the proof from impossible to possible. I'm assuming decrements are not infinitesimally small.
”
对这个提问的一个回答很好。

“
There is no change in the algorithm, so it is not a change in the behavior of the weight vector, just in whether this proof technique works.

The change is from "D starts at 50. If D gets to 10 we win. D either decreases by 1 or increases by 1 at each step." Then you haven't proved D can't oscillate 40-41-40-41 forever, for example.

Using a generously feasible vector changes it to "D starts at 100. If it gets to 10 we win. D decreases by 1 at each step."
”

第五部分
limitation 来自所依赖的 features

still lots of to do even if you don't learn features.
e.g. hand define a huge number of features

group invariance theorem

so the tricky part of pattern recognization must be solved by the hand-coded feature detectors, not the learning procedure.

the longer term conclusion is that NNs are only gonna be really powerful if we can learn the feature detectors. Its not enough just to learn weight sum of feature detectors, we have to learn the feature detectors themselves.

more layers of linear units do not help. Its still linear.

Fixed output non-linearities are not enough.
what we need is multiple layers of adaptive non-linear hidden units. And the problem is how can we train such
NN.

Minsky and Papert's "Group Invariance Theorem" says that the part of a Perceptron that learns cannot learn to do this if the transformations form a group.
上句的this指的是（存在形式变换的）模式识别任务。

总结
这一课的难点，个人觉得在于理解 “feasible region” 与 “generously feasible region” 的区别。理解这个，是为了理解Lecture 2d的proof是否成立。