Welcome to the Parallel World: 2012

Saturday, November 24, 2012

梯度下降 for NN (1)

梯度下降的 6 种变形：

Back propagation
delta-bar-delta (adaptive learning rate)
steepest descent
QuickProp
Gauss-Newton
Levenberg-Marquardt

梯度下降 for NN (2)

find the slope of the error surface
how much to go down the error surface
how the weights should be adjusted after each iteration
when to stop training and what governs the decision
whether the solution is suboptimal
how good the final model is and how to assess its goodness of fit

Friday, November 23, 2012

《群体智能与仿生计算》阅读笔记（1）

偶然从图书馆看到这本书。还没有深入读，先把讲授的主要算法列出：

演化算法 Evolutionary Algorithm
人工免疫算法 Artificial Immune Algorithm
混合遗传算法、遗传局部优化 Memetic Algorithm
粒子群算法 Particle Swarm Optimization
混合蛙跳算法 Shuffled Frog Leaping Algorithm
猫群算法 Cat Swarm Optimization
细菌觅食算法 Bacterial Foraging Optimization
蚁群算法 Ant Colony Algorithm
人工蜂群算法 Artificial Bee Colony Algorithm

Saturday, November 17, 2012

《大规模并行编程实战》 12.2节读书笔记

大型虚拟和物理地址空间
统一的设备地址空间
可配置的缓存和暂存空间
提高原子操作的速度
提高全局存储器的访问速度

Reference

Gelado, I. et al, An asymmetric distributed shared memory model for heterogeneous parallel systems.

《OpenCL异构计算》

第十二章 OpenCL 性能剖分和调试读书笔记

剖分

基于事件的剖分
AMD APP Profiler
AMD APP KernelAnalyzer

调试

gDEBugger
AMD printf 扩展

《OpenCL异构计算》

第十一章 OpenCL 扩展读书笔记

扩展机制概览

OpenCL 定义了三种不同的扩展：

KHR 扩展
EXT 扩展
厂商扩展

设备拆分

EXT 扩展 - 设备拆分提供了一个用于将 OpenCL 设备拆分成为多个子设备的接口。

双精度

cl_khr_fp64

使用以下制导语句开启：

#pragma cl_khr_fp64 : enable

Friday, November 16, 2012

《OpenCL异构计算》

第六章 OpenCL 在 CPU/GPU 平台上的实现

第3节 OpenCL的内存性能读书笔记

Outline：

OpenCL 全局内存
本地内存 —— 软件管理的 cache

OpenCL 全局内存

内存分析的基本方法是判断 kernel 的实际吞吐量，通过计算 kernel 的内存带宽得到。公式如下：

其中：

EB 是有效带宽；

Br 是从全局内存中读取的字节数

Bw 是写入到全局内存的字节数

T 是 kernel 的执行时间，可由调用OpenCL 计时 API 或剖分工具得到。

将测得的 kernel 实际使用带宽，我们就可以和设备的峰值带宽相比较，以此来确定我们与峰值性能的差距。越接近峰值性能，内存系统的利用率就越高。如果与峰值性能差距较大，就需要考虑调整访存模式以提高带宽利用率。

本地内存 —— 软件管理的 cache

相对于硬件控制的 cache，本地内存（scratchpad 内存）有很多优点，比如占用片上空间更少，能效比更高，在给定区域性能更高。同时也是同一个 work-group 内部的 work-item 之间相互交换数据的一个更重要的，保证访存低延迟的有效方法。

《OpenCL异构计算》 chapter 6 读书笔记

这一节是第6章的第一部分。讲述了 OpenCL 在 AMD Phenom II X6 处理器上的实现。

设计目标是使得 OpenCL 代码以统一的方式在 AMD CPU 和 GPU 执行。

主要讲了 4 个方面的内容：

如何 mapping OpenCL kernel 到 CPU core 上面？ host 代码在哪里执行？
如何实现 barrier 同步功能？
如何实现向量类型及其操作？
如何 mapping 各种 OpenCL device memory 到 CPU memory hierarchy？

在主机端，OpenCL 运行时像操作系统和其他应用程序一样在 X86 CPU 上允许。

在设备端，通过 OpenCL 运行时队列机制，使 OpenCL C 代码可以在 X86 设备上编译和运行。

该CPU 体系结构如下图所示：

OpenCL 到该处理器的映射如下图所示：

Note that the host runs on the same cores that represent the OpenCL device’s compute units.

将整个 CPU 看作一个单独的设备。不同的核作为不同的 Compute Unit 。不过可以通过设备拆分技术将 CPU 拆分成多个设备（在第7章讨论）。

The OpenCL CPU runtime creates a thread to execute on each core of the CPU as a work pool to process OpenCL kernels as they are generated. These threads are passed work by a core management thread for each queue that has the role of removing the first entry from the queue and setting up work for the worker threads. Any given OpenCL kernel may comprise thousands of workgroups for which arguments must be appropriately prepared, memory allocated, and, if necessary, initialized and

work queues generated.

以上是讲了 OpenCL kernel 如何映射到 CPU 的各个 core 上面去。接下来讲，如何实现同步功能。
OpenCL 利用 barrier 和 fence 来实现细粒度的同步。但OS负责管理线程通信，线程与操作系统交互带来的开销阻碍高效并行扩展的实现。除此之外，在多个核上运行一个单独的 work-group 会导致 cache-sharing 问题。

针对以上问题的解决方案是：
OpenCL CPU 运行时在一个系统线程上运行一个 work-group 。 OpenCL 线程轮流运行同一个 work-group 内的每一个 work-item ，当这个 work-group 内的所有 work-item 全部运行完成后，再运行同一个工作队列中的下一个 work-group 。因此，同一个 work-group 内的线程是没有并行性的。如果可能的话，多个系统线程将允许多个 work-group 并行执行。
由于 barrier 同步操作的存在，使得同一个 work-group 中的不同 work-item 可以并发执行。出于性能方面的考虑，通过操作系统的线程抢占方式实现 barrier 操作是不行的。在 AMD OpenCL 运行时中，barrier 操作是通过 setjmp 和 longjmp 函数（AMD 自己的版本）实现。

下图显示了一个包含 barrier 操作的 kernel 在多核 CPU 上的执行映射：

讲完如何实现 barrier 同步后，接着讲如何实现 OpenCL C 的各种向量类型及其操作。主要是通过 SSE 指令扩展来实现。向量类型存储在向量寄存器中，对向量的操作编译成 SSE 指令。

最后谈到了如何把 OpenCL Device Memory 映射到 CPU cache 上。下图进行了说明：

为改进 cache 本地化，本地内存区域以每个 CPU 线程一个数组的方式进行分配，并且可被这个线程执行的所有的 work-group 重用。存储在寄存器中的 work-item 的局部数据会在调用 setjmp 函数时被备份到主存的 work-item 栈中。This memory is carefully laid out to behave well in the cache, reducing cache contention and hence conflict misses and improving the utilization of the cache hierarchy. In particular, the work item stack data is staggered in memory to reduce the chance of conflicts, and data is maintained in large pages to ensure contiguous mapping to physical memory and to reduce pressure on the CPU’s translation lookaside buffer。

Wednesday, November 14, 2012

OpenCL的并发和执行模型

《OpenCL异构计算》

第五章 OpenCL的并发与执行模型读书笔记

Outline

kernel, work_item, workgroup 概念, 执行域的概念
OpenCL 同步： kernel, fence 和 barrier
队列与全局同步
主机端内存模型
设备端内存模型

kernel, work_item, workgroup 概念, 执行域的概念

基本概念
kernel 实例（线程）= work_item
NDRange 定义并行执行空间。
一个 work_item 定义 one sliver of a large parallel execution space。
一个 work_item 运行在一个 PE (Processing Element) 上。
一个 PE 可处理多个 work_item （在一个 kernel 执行过程中）。
work-group 是固定大小数目的 work_item 的集合。
一个 work-group 运行在一个 CU (Compute Unit) 上。

同步问题
问题1 work-item 之间是否定义同步？如何同步？
答：在运行中，work_item 之间相互独立，OpenCL 未定义（任意两个）work_item 之间的同步。这样做，是为了保证 OpenCL 执行模型的可扩展性。允许一个 work-group 内部的本地同步。

通信问题
问题1 work-group 内部是否允许通信？
答：一定程度允许。但被限制，以提高可扩展性。Local Store？？？

重要接口函数

uint get_work_dim(): Returns the number of dimensions in use in the dispatch.
uint get_global_size(uint dimension): Returns the global number of work items in the dimension requested.
uint get_global_id(uint dimension): Returns the index of the current work item in the global space and in the dimension requested.
uint get_local_size(uint dimension): Returns the size of workgroups in this dispatch in the requested dimension.
uint get_local_id(uint dimension): Returns the index of the current work item as an offset from the beginning of the current workgroup.
uint get_num_groups(uint dimension): Returns the number of workgroups in the specified dimension of the dispatch. This is get_global_size divided by get_local_size.
uint get_group_id(uint dimension): Returns the index of the current work group. That is, the global index of the first work-item in the workgroup, dividing by the workgroup size.

OpenCL 同步： kernel, fence 和 barrier

OpenCL 同步的执行方式要点如下：

一个 work-item 的读操作和另一个 work-item 的写操作没有任何顺序保证。
OpenCL 采用宽松的同步模型和内存一致性模型。
为了获得松弛一致性，The solution is that OpenCL explicitly defines synchronization points where the programmer knows with certainty what the state of some part of the system is and can rely on that information to obtain expectations of behavior.
在 GPU 上，停止一个 wavefront 的运行不会释放它所占用的资源。因此，在 GPU 上，如果采用操作系统那种可以随时停止或者调度线程执行的方式，则可能出现这样一种状况：一个已经调度在设备上的 wavefront 需要得到另外一个还没有调度到设备上的 wavefront 释放信号量才能开始运行。为避免这种情况，全局同步只定义在 kernel 执行的边界处。
不同 work-group 的两个 work-item，并没有定义方法来确保它们的执行顺序。
数据共享（通信），主要是实现了同一个work-group 的 work-item 之间，通过 local memory 来进行数据共享。
OpenCL 定义了一个 work-group 内部的同步操作。通过 barrier 函数。 barrier(CLK_LOCAL_MEM_FENCE);
在 kernel 代码中，如果某个分支调用了 barrier 来同步，则必须保证同一个 work-group 中的所有 work-item 都会执行该分支。如果有 work-item 未执行包含该 barrier 调用的分支，则 barrier 的行为是不确定的。在很多设备上，会导致死锁。

任务队列与全局同步

OpenCL内存一致性
事件
事件回调
命令 barrier 与 mareker

队列
问题1 能够加入 OpenCL 队列的命令有哪些？

kernel 执行命令
内存操作命令
同步命令

问题2 入队的方式是？

答：kernel 执行命令和同步命令都是异步入队。

全局同步

特点：一个命令的完成只能在同步点得到保证。

问题1 有哪些同步点？

调用 clFinish
等待一个特定事件的完成
执行一个阻塞访存操作

内存一致性

问题1 同步命令的作用是？

答：保证在同步点上， work 的同步和内存一致的视图。这意味着命令之间的一致性，也即通信的正确性在一个有序队列的各命令间或者在产生事件和等待该事件的命令之间是有保证的。

问题2 主机 host 与设备 device 之间的一致性如何保证？

答：只能通过阻塞操作达到目的。问题1中一致性保证是在运行时保持内存对象的一致性，而对于主机端 API 来说是不可见的。

问题3 设备之间的一致性如何保证？

答：内存对象是与上下文相关而不是与设备相关，所以在数据被共享和发生相应的事件时，OpenCL 运行时要保证这些对象在不同设备之间的一致性。

Data is moved from one device to another such that if a kernel is to be executed on a second device, any results generated on the first will be available when necessary. The completion of an event on the first data structure is the guarantee that the data is OK to move and no separate buffer copy operation is

needed. 这2句话如何理解？

事件

问题1 构建命令队列的参数问题

`cl_command_queue clCreateCommandQueue(`	cl_context `context`,
	cl_device_id `device`,
	cl_command_queue_properties `properties`,
	cl_int `*errcode_ret)`

cl_command_queue_properties 设置可以是

Command-Queue Properties Description

`CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE` Determines whether the commands queued in the command-queue are executed in-order or out-of-order. If set, the commands in the command-queue are executed out-of-order. Otherwise, commands are executed in-order.

`CL_QUEUE_PROFILING_ENABLE` Enable or disable profiling of commands in the command-queue. If set, the profiling of commands is enabled. Otherwise profiling of commands is disabled. See clGetEventProfilingInfo for more information.

Command-Queue Properties	Description
`CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE`	Determines whether the commands queued in the command-queue are executed in-order or out-of-order. If set, the commands in the command-queue are executed out-of-order. Otherwise, commands are executed in-order.
`CL_QUEUE_PROFILING_ENABLE`	Enable or disable profiling of commands in the command-queue. If set, the profiling of commands is enabled. Otherwise profiling of commands is disabled. See clGetEventProfilingInfo for more information.

直接使用事件是与主机同步的第三种方法。OpenCL 通过构建一个任务图来指明各事件之间的依赖关系。也可以调用 clWaitForEvents 来指定一个 wait 事件。

问题2 如何管理多个设备上的命令队列
答：可以将多个队列映射到相同的设备上，实现不同命令之间的重叠执行或者命令与 host-device 通信操作之间的重叠执行。在一个有多个设备的系统中，每个设备都需要自己的命令队列。在关联了多个设备的同一个上下文中，可为每个设备创建自己的命令队列。而且可以使用事件来实现命令间的同步。如果多个设备各自有各自自己的上下文，则不能使用事件进行同步。此时，只能同步依靠 clFinish 实现，或是显式的在对象间拷贝数据。

问题3 OpenCL异构编程框架下有多少中执行模型？

两个或多个流水执行

多个独立并行

问题4 除同步外，事件还有哪些用途？

答：可用于查询 OpenCL 命令执行相关的错误状态或者程序剖分数据。通过 getInfo 函数实现。

问题5 用户事件是什么？如何使用？

答：用户自定义的时间。通过 clCreateUserEvent 创建。用于将依赖于其他任意任务的命令入队。用户事件可以像其它事件一样传给入队函数，但其执行状态是用户显式的给出的。

事件回调

问题1 事件回调有啥用？

答：同样是定义回调函数，在事件到达指定状态时触发。事件回调可 用于将新命令入队 或 用于调用主机端函数。回调函数应该是轻量级的。clSetEventCallback 函数的调用位置很重要。必须位于产生其所需的有效事件的函数调用之后。

问题2 事件回调的使用有啥讲究？

答：必须非常小心。

为同一个事件多个执行状态注册的回调函数，在命令状态改变时，不保证这些回调函数按照正确的顺序调用执行。
回调函数是线程安全的，会被异步调用。
当调用开销大的系统调用或阻塞式的 OpenCL API （如 clFinish）时，回调函数的行为是未定义的。

这种使用事件处理的机制的好处是应用级行为和同步操作可在不同厂商提供的 CPU 和 GPU 上用统一的方式进行处理。这样就将设备相关的调优局限于计算核中。

问题3 事件回调之外，还有啥替代方案？

答：可使用 native kernel。将普通 C 函数入队。

命令 barrier 与 marker

问题1 不同命令队列的同步除事件外，还有什么方法？

答：还有一种与 work-group 内部同步类似的方法，就是使用 barrier。marker 与 barrier 类似，但是不阻塞执行。当有事件在等待命令队列中的一些命令完成时，marker 非常有用，但是不会影响到其他命令的执行。还有一个同步原语是 waitForEvents，其行为与 marker 相反。不等待所有任务完成，只等待特定任务完成。

主机端内存模型

buffer 对象
image 对象

OpenCL 内存对象定义在上下文，而不是设备上。通常 buffer 上的操作是没有必要涉及到具体硬件设备的。保证数据在正确的时间到达正确的位置是运行时的工作。

buffer 对象
从 CPU 的意义来看，buffer 对象是一维数组，类似于 C 程序中 malloc 开辟的内存。是连续的，RMA。OpenCL 还定义了 sub-buffer 对象上下文，允许把一个单独的 buffer 切分成许多可以重叠的较小的 buffer，而这些 sub-buffer 的读写复制与 buffer 的操作是相同的。sub-buffer 的重叠，以及他与父buffer 的合并可能会造成别名问题。

imgae 对象
对图像数据进行了优化。体现在：

GPU 的 cache 层次结构以及数据流结构都是为了优化图像数据类型的访问而设计的。
GPU 驱动程序优化了数据布局以提供对硬件的高效访存的支持，特别是使用二维访存模式时。
image 访存支持复杂的数据类型转换以允许数据以广泛的压缩格式进行存储。

不像 buffer，image 不同内存对象之间的位置关系对开发人员来说是不可见的。image 数据结构不仅对开发人员不可见，对 kernel 代码亦完全不可见。只能通过专门的访问函数来存取。

image 数据格式由通道序列和通道类型组成。支持运行时系统和硬件优化。image 对象在设备端不能通过指针进行直接访问，也不能在同一个 kernel 中进行读操作和写操作。消除了锯齿数据的可能性，可以进行安全缓存。

Z序映射方式增加图像数据局部性

设备端内存模型

设备端宽松的内存一致性模型
全局内存
本地内存
常量内存
私有内存

设备端宽松的内存一致性模型

work-item 内部内存操作有可预测的顺序：即任意两个对同一地址的读和写是不会被硬件或者编译器重新排序的。
在同一个 work-group 的不同 work-item 之间，只在 barrier 操作处保证其内存一致性。
在 work-group 之间，在 kernel 执行完成前，不保证内存一致性。

为了使位于同一个或者不同 work-group 中的 work-item 之间开展某种程度的通信，OpenCL 定义了一组 fence 操作。但是即使使用这些 fence 操作也不能保证 work-item 的执行顺序。

read_mem_fence( cl_mem_fence_flags flags )
write_mem_fence( cl_mem_fence_flags flags )
mem_fence( cl_mem_fence_flags flags )

保证不同 work-item 之间内存操作正确的另一个方法是原子操作。可以在不影响其他 work-item 的情况下，保证数据读或读写以及其他操作的正确性。（浮点数原子操作设计复杂）OpenCL 目前仅定义了整数原子操作，如 atomic_add 等。

Tuesday, November 6, 2012

（转）Intel网站上关于OpenCL概念的解释

一些重要OpenCL概念的解释：

Computing unit - An OpenCL* device has one or more compute units. A work-group executes on a single compute unit. A compute unit is composed of one or more processing elements and local memory. A compute unit may also include dedicated texture filter units that can be accessed by its processing elements.

Device - A device is a collection of compute units.

A command-queue is used to queue commands to a device. Examples of commands include executing kernels, or reading and writing memory objects. OpenCL* devices typically correspond to a GPU, a multi-core CPU, and other processors such as DSPs and the Cell/B.E. processor.

A kernel is a function declared in a program and executed on an OpenCL* device. A kernel is identified by the kernel or kernel qualifier applied to any function defined in a program.

Work item - one of a collection of parallel executions of a kernel invoked on a device by a command. A work-item is executed by one or more processing elements as part of a work-group executing on a compute unit. A work-item is distinguished from other executions within the collection by its global ID and local ID.

Work-group - a collection of related work-items that execute on a single compute unit. The work-items in the group execute the same kernel and share local memory and work-group barriers.

Each work-group has the following properties:

Data sharing between work items via local memory
Synchronization between work items via barriers and memory fences
Special work-group level built-in functions, such as work_group_copy.

A multi-core CPU or multiple CPUs (in a multi-socket machine) constitute a single OpenCL* device. Separate cores are compute units. Device Fission extension enables you to control compute unit utilization within a compute device. You can find more information in the ‘Device Fission Extension Support’ section of the Intel® SDK for OpenCL* User’s Guide (see Related Documents).

When launching the kernel for execution, the host code defines the grid dimensions, or the global work size. The host code can also define the partitioning to work-groups, or leave it to the implementation. During the execution, the implementation runs a single work item for each point on the grid. It also groups the execution on compute units according to the work-group size.

The order of execution of work items within a work-group, as well as the order of work-groups, is implementation-specific.

Task-Parallel programming model – the OpenCL* programming model that runs a single work-group with a single work item.

Monday, October 29, 2012

NN Course Lecture 3

第一个部分
学习线性神经元的权值
1.第一个问题
在开头介绍本节课内容时，先讲了线性神经元与感知器很相似。接着说，其中区别在于感知器力求权值集在不断迭代过程中，接近理想值。而线性神经元力求输出不断接近理想值。

2.第二个问题
谈到的是为啥感知器学习算法不能用于含有隐藏层的NN。
原因，含有隐藏层的NN 的权值（向量）组成的集合不一定是凸（convex）的。
指出原因之后，又给出两个小结。
And to prove that when they're learning something is improving, we don't use the same kind of proof at all.
一个是，在证明学习在“进步”的方式上，不再采用证明感知器的方法。
另一个是，多层NN不使用感知器学习算法。也不存在多层感知器的概念，说到感知器就是指只有输入层和输出层的那种简单NN。

3.对于多层 NN，需要一个不同于感知器的方法来让学习 “演进”。

4.以输出是否不断接近理想值来判断学习是否在进步。这种判断方法对于那些solution set不是凸集的问题也是适用的。这和感知器学习是不同的。在感知器学习中，即便被训练的权值越来越接近理想的权值集，输出作为一个整体可能会不断远离理想值。最后引出 a linear neuron with a squared error measure.

5.linear neuron 在电子工程领域也被成为 linear filter。

6.输出值是一个实数值，用线性表达式累加值表示。

7.用squared distance来表征输出值与目标值的差距。称之为error。error是所有训练样本的误差平方和。

8.线性神经元的学习目标是不断减小此error。

9.第三个问题
为啥不能用分析的方法(分析法，就是解方程法吗？是不是更应该翻译成解析法?)解决此学习问题。（为每个输入case建立一个方程然后解方程。）
1）第一种回答：scientific answer，希望采用一种人脑神经真正采用的方法。人脑也许不会是靠解一堆符号化的方程来学习的。
2）第二种回答：engineering answer，希望找到一种能解决多层，非线性 NN 学习问题的通用方法。分析型的方法依赖此问题必须满足：(1)是线性的（方程可解，且容易解）和（2）以平方距离作为 “差距”（或错误）衡量手段。但是基于迭代的方法虽然效率上低一些，但是却更容易推广到多层，非线性等这些更复杂的场合。

10.学习过程中，只是关注 output与target 越来越接近，不保证某个weight值会在某几次迭代中，离理想值（正确值）越来越远。

11.第四个问题
迭代学习过程是否可以得到正确的解？也许没有完美解。如果将学习率设置得足够小（也就是学习的歩子放得很小），就可以尽可能的接近最好的解。

12.第五个问题
收敛的速度问题。如果输入的维度之间是高度相似的，则会很慢。比如，输入中的ketchup和chips的数量总是一样的，就很难决定两者的价格。

13.online delta rule 和用于感知器学习的规则的区别
在感知器学习中，只是在感知器犯错的时候，更新weight值。而在 online delta rule 是每一步（每一次迭代）都变化。这也是引入学习率因子的原因。

第二部分
Lecture 3b The error surface for a linear neuron
1.线性 NN 的error表面，在weight空间基础上扩展出来。

2.可以想象一个3维的空间，平面两个坐标分别是两个weight值 w1 和 w2，垂直向上的是 E。
这样，纵截面切下来得到的是一个抛物线。而横截面与error surface相切得到的是椭圆。

3.比较在线学习和批量学习的区别
（一种最简单的）batch learning走的是垂直于表面切线方向的方式。（steepest descent）
而“在线学习”走的是一种之字形路线。行进的路线由约束曲线（由输入集合决定）决定。（zig-zags around the direction of steepest descent）

4.细长形的 weight 椭圆会造成学习过程非常慢

第三部分
Learning the weights of a logistic output neuron
1.介绍对数输出的神经元（S函数作为激活方程）
2.delta rule x*(t-y)
3.external term y(1-y), this is so-called "slope of logistic".

第四部分
The backpropagation algorithm
此部分介绍如何学习多层网络。

1.问啥需要隐藏层？
very limited in input-output mapping they can model.

2.方法采用 persturbing weights

3. what's ‘activity’?

第五部分
How to use the derivatives computed by the backpropagation algorithm
1.Converting error derivatives into a learning procedure
后向传播是一种计算 error差分 dE/dw 的有效的方法。
但是如何使用这写计算出来的误差差分值有学问。考虑两方面的问题：一个是优化问题 How do we use the error derivatives on individual cases to discover a good set of weights? 另一个是泛化问题 How do we ensure that the learned weights work well for cases we did not see during training?

2.优化问题又包括两个方面
（1）How often to update the weights
（2）How much to update

3.在泛化问题中，要避免过拟合。
If the model is very flexible it can model the sampling error really well. This is a disaster.

从上面看出，讲过度拟合是在说到泛化问题时讲的。因为过度拟合会把采样错误也拟合得很好，导致模型的不正确，进而导致无法泛化。
避免过度拟合的方法有以下这些。
– Weight-decay
– Weight-sharing
– Early stopping
– Model averaging
– Bayesian fitting of neural nets
– Dropout
– Generative pre-training

Friday, October 26, 2012

NN Courese Leture 2

140亿 = 14 000 000 000

Lecture2
第一部分
overview了各种神经网络。
Feed Forward NN 用的最普遍。同时这里介绍了什么是 deep NN。也就是超过了一层 hidden layer的 FFNN。
然后重点介绍了RNN。这个估计现在依然是研究的热点（还没到大规模应用阶段）。
比较难理解的是 RNN are a very natural way to model sequential data. 咋一看，没看出 natural 来。

先说 RNN 的特点是在units的连接图中间，可能存在着圈。也就是从某个节点出发的连接可能一步步转回来连接到自己。这一点与只能“向前看”的 FFNN 区别是明显的。
接着说到，由于 RNN 有很多复杂的 dynamics ，所以训练这种网络很困难。具体的这些dynamics是什么，没有说。只是说了如何找到一个有效的训练 RNN 的方法是一个热点话题。最后说了，从生物学角度，这种 NN 更realistic一些。我理解就是跟我们的大脑更接近一些吧。
还有一句话，说 Recurrent neural networks with multiple hidden layers are really just a special case of a general recurrent neural network that has some of its hidden to hidden connections missing。我理解就是 RNN 多层隐藏层可以看做一个只含有单层隐藏层的RNN的特例（略去一些hidden to hidden 的连接）。

接着，就说到了 RNN 很 natural的model sequential data。说了两点理由。一个是 RNN 等价于一个很 deep 的 NN，而这个 NN 每一个 time slice 有一层隐藏层。每个 time slice 中那个隐藏层都有同样地权重值，每一个 time slice都接收（新的？）输入。

接着说，RNN 有这样一个本事。就是，可以remember information in their hidden state for a long time。但是问题是，very hard to train them to use this potential.

然后就举例了 RNN 的神奇。Ilya Sutskever 训练 RNN 去预测the next character in a sequence。

最后，说到了 Symmetrically connected networks SCNN。不能model cycle，但是比 RNN 更容易分析。遵循能量方程，是 SCNN “在能做什么”上面比 RNN 更有限。不包含隐藏层的 SCNN 也称为 Hopfield nets。有隐藏层的 SCNN 称为 Boltzmann machines。比 Hopfield 强，比 RNN 弱。但是，Boltzmann machines 有一个 beautifully simple learning algorithm.

第二部分
讲感知器。

先比较了统计模式识别的标准范式和标准的感知器模型。
统计模式识别
1.把原始输入向量，转换为特征向量。
2.学习如何给各个特征赋予权重，并得到一个标量值。
3.如果这个标量值大于（或小于）某个阈值则确定输入是属于某一类的样本。

感知器的模型跟此过程很相似。也是一个求权重的过程，然后利用激活函数求一个标量值。
The perceptron learning procedure is still widely used today for tasks with enormous feature vectors that contain many millions of features.

第三部分
hyperplane的概念
sigma（w*x）=0 表示了一个超平面。

第四部分
why the learning works?

generously feasible region 和 feasible region的区别
对于这个问题，有人在课程论坛上发了这么个帖子。
“
It's not clear to me why the introduction of a generously feasible solution changes the informal proof in Lecture 2d from incomplete to complete. I understand that without a generously feasible solution and a minimum decrement the weight vector may move away from the feasible weight vector(s) during some iterations, but if a solution exists won't it converge eventually? I can see how a minimum decrement reduces oscillation before convergence, but I don't see why it changes the proof from impossible to possible. I'm assuming decrements are not infinitesimally small.
”
对这个提问的一个回答很好。

“
There is no change in the algorithm, so it is not a change in the behavior of the weight vector, just in whether this proof technique works.

The change is from "D starts at 50. If D gets to 10 we win. D either decreases by 1 or increases by 1 at each step." Then you haven't proved D can't oscillate 40-41-40-41 forever, for example.

Using a generously feasible vector changes it to "D starts at 100. If it gets to 10 we win. D decreases by 1 at each step."
”

第五部分
limitation 来自所依赖的 features

still lots of to do even if you don't learn features.
e.g. hand define a huge number of features

group invariance theorem

so the tricky part of pattern recognization must be solved by the hand-coded feature detectors, not the learning procedure.

the longer term conclusion is that NNs are only gonna be really powerful if we can learn the feature detectors. Its not enough just to learn weight sum of feature detectors, we have to learn the feature detectors themselves.

more layers of linear units do not help. Its still linear.

Fixed output non-linearities are not enough.
what we need is multiple layers of adaptive non-linear hidden units. And the problem is how can we train such
NN.

Minsky and Papert's "Group Invariance Theorem" says that the part of a Perceptron that learns cannot learn to do this if the transformations form a group.
上句的this指的是（存在形式变换的）模式识别任务。

总结
这一课的难点，个人觉得在于理解 “feasible region” 与 “generously feasible region” 的区别。理解这个，是为了理解Lecture 2d的proof是否成立。

Saturday, November 24, 2012

Friday, November 23, 2012

Saturday, November 17, 2012

《大规模并行编程实战》 12.2节 读书笔记

《OpenCL异构计算》

第十二章 OpenCL 性能剖分和调试 读书笔记

《OpenCL异构计算》

第十一章 OpenCL 扩展 读书笔记

扩展机制概览

设备拆分

双精度

Friday, November 16, 2012

《OpenCL异构计算》

第六章 OpenCL 在 CPU/GPU 平台上的实现

第3节 OpenCL的内存性能 读书笔记

OpenCL 全局内存

本地内存 —— 软件管理的 cache

OpenCL在CPU上的实现

《OpenCL异构计算》 chapter 6 读书笔记

Wednesday, November 14, 2012

《OpenCL异构计算》

第五章 OpenCL的并发与执行模型 读书笔记

kernel, work_item, workgroup 概念, 执行域的概念

OpenCL 同步： kernel, fence 和 barrier

任务队列与全局同步

任务队列与全局同步

cl_command_queue_properties 设置可以是

主机端内存模型

设备端内存模型

Tuesday, November 6, 2012

一些重要OpenCL概念的解释：

Monday, October 29, 2012

Friday, October 26, 2012

《大规模并行编程实战》 12.2节读书笔记

第十二章 OpenCL 性能剖分和调试读书笔记

第十一章 OpenCL 扩展读书笔记

第3节 OpenCL的内存性能读书笔记

第五章 OpenCL的并发与执行模型读书笔记