Welcome to the Parallel World: November 2012

Saturday, November 24, 2012

梯度下降 for NN (1)

梯度下降的 6 种变形：

Back propagation
delta-bar-delta (adaptive learning rate)
steepest descent
QuickProp
Gauss-Newton
Levenberg-Marquardt

梯度下降 for NN (2)

find the slope of the error surface
how much to go down the error surface
how the weights should be adjusted after each iteration
when to stop training and what governs the decision
whether the solution is suboptimal
how good the final model is and how to assess its goodness of fit

Friday, November 23, 2012

《群体智能与仿生计算》阅读笔记（1）

偶然从图书馆看到这本书。还没有深入读，先把讲授的主要算法列出：

演化算法 Evolutionary Algorithm
人工免疫算法 Artificial Immune Algorithm
混合遗传算法、遗传局部优化 Memetic Algorithm
粒子群算法 Particle Swarm Optimization
混合蛙跳算法 Shuffled Frog Leaping Algorithm
猫群算法 Cat Swarm Optimization
细菌觅食算法 Bacterial Foraging Optimization
蚁群算法 Ant Colony Algorithm
人工蜂群算法 Artificial Bee Colony Algorithm

Saturday, November 17, 2012

《大规模并行编程实战》 12.2节读书笔记

大型虚拟和物理地址空间
统一的设备地址空间
可配置的缓存和暂存空间
提高原子操作的速度
提高全局存储器的访问速度

Reference

Gelado, I. et al, An asymmetric distributed shared memory model for heterogeneous parallel systems.

《OpenCL异构计算》

第十二章 OpenCL 性能剖分和调试读书笔记

剖分

基于事件的剖分
AMD APP Profiler
AMD APP KernelAnalyzer

调试

gDEBugger
AMD printf 扩展

《OpenCL异构计算》

第十一章 OpenCL 扩展读书笔记

扩展机制概览

OpenCL 定义了三种不同的扩展：

KHR 扩展
EXT 扩展
厂商扩展

设备拆分

EXT 扩展 - 设备拆分提供了一个用于将 OpenCL 设备拆分成为多个子设备的接口。

双精度

cl_khr_fp64

使用以下制导语句开启：

#pragma cl_khr_fp64 : enable

Friday, November 16, 2012

《OpenCL异构计算》

第六章 OpenCL 在 CPU/GPU 平台上的实现

第3节 OpenCL的内存性能读书笔记

Outline：

OpenCL 全局内存
本地内存 —— 软件管理的 cache

OpenCL 全局内存

内存分析的基本方法是判断 kernel 的实际吞吐量，通过计算 kernel 的内存带宽得到。公式如下：

其中：

EB 是有效带宽；

Br 是从全局内存中读取的字节数

Bw 是写入到全局内存的字节数

T 是 kernel 的执行时间，可由调用OpenCL 计时 API 或剖分工具得到。

将测得的 kernel 实际使用带宽，我们就可以和设备的峰值带宽相比较，以此来确定我们与峰值性能的差距。越接近峰值性能，内存系统的利用率就越高。如果与峰值性能差距较大，就需要考虑调整访存模式以提高带宽利用率。

本地内存 —— 软件管理的 cache

相对于硬件控制的 cache，本地内存（scratchpad 内存）有很多优点，比如占用片上空间更少，能效比更高，在给定区域性能更高。同时也是同一个 work-group 内部的 work-item 之间相互交换数据的一个更重要的，保证访存低延迟的有效方法。

《OpenCL异构计算》 chapter 6 读书笔记

这一节是第6章的第一部分。讲述了 OpenCL 在 AMD Phenom II X6 处理器上的实现。

设计目标是使得 OpenCL 代码以统一的方式在 AMD CPU 和 GPU 执行。

主要讲了 4 个方面的内容：

如何 mapping OpenCL kernel 到 CPU core 上面？ host 代码在哪里执行？
如何实现 barrier 同步功能？
如何实现向量类型及其操作？
如何 mapping 各种 OpenCL device memory 到 CPU memory hierarchy？

在主机端，OpenCL 运行时像操作系统和其他应用程序一样在 X86 CPU 上允许。

在设备端，通过 OpenCL 运行时队列机制，使 OpenCL C 代码可以在 X86 设备上编译和运行。

该CPU 体系结构如下图所示：

OpenCL 到该处理器的映射如下图所示：

Note that the host runs on the same cores that represent the OpenCL device’s compute units.

将整个 CPU 看作一个单独的设备。不同的核作为不同的 Compute Unit 。不过可以通过设备拆分技术将 CPU 拆分成多个设备（在第7章讨论）。

The OpenCL CPU runtime creates a thread to execute on each core of the CPU as a work pool to process OpenCL kernels as they are generated. These threads are passed work by a core management thread for each queue that has the role of removing the first entry from the queue and setting up work for the worker threads. Any given OpenCL kernel may comprise thousands of workgroups for which arguments must be appropriately prepared, memory allocated, and, if necessary, initialized and

work queues generated.

以上是讲了 OpenCL kernel 如何映射到 CPU 的各个 core 上面去。接下来讲，如何实现同步功能。
OpenCL 利用 barrier 和 fence 来实现细粒度的同步。但OS负责管理线程通信，线程与操作系统交互带来的开销阻碍高效并行扩展的实现。除此之外，在多个核上运行一个单独的 work-group 会导致 cache-sharing 问题。

针对以上问题的解决方案是：
OpenCL CPU 运行时在一个系统线程上运行一个 work-group 。 OpenCL 线程轮流运行同一个 work-group 内的每一个 work-item ，当这个 work-group 内的所有 work-item 全部运行完成后，再运行同一个工作队列中的下一个 work-group 。因此，同一个 work-group 内的线程是没有并行性的。如果可能的话，多个系统线程将允许多个 work-group 并行执行。
由于 barrier 同步操作的存在，使得同一个 work-group 中的不同 work-item 可以并发执行。出于性能方面的考虑，通过操作系统的线程抢占方式实现 barrier 操作是不行的。在 AMD OpenCL 运行时中，barrier 操作是通过 setjmp 和 longjmp 函数（AMD 自己的版本）实现。

下图显示了一个包含 barrier 操作的 kernel 在多核 CPU 上的执行映射：

讲完如何实现 barrier 同步后，接着讲如何实现 OpenCL C 的各种向量类型及其操作。主要是通过 SSE 指令扩展来实现。向量类型存储在向量寄存器中，对向量的操作编译成 SSE 指令。

最后谈到了如何把 OpenCL Device Memory 映射到 CPU cache 上。下图进行了说明：

为改进 cache 本地化，本地内存区域以每个 CPU 线程一个数组的方式进行分配，并且可被这个线程执行的所有的 work-group 重用。存储在寄存器中的 work-item 的局部数据会在调用 setjmp 函数时被备份到主存的 work-item 栈中。This memory is carefully laid out to behave well in the cache, reducing cache contention and hence conflict misses and improving the utilization of the cache hierarchy. In particular, the work item stack data is staggered in memory to reduce the chance of conflicts, and data is maintained in large pages to ensure contiguous mapping to physical memory and to reduce pressure on the CPU’s translation lookaside buffer。

Wednesday, November 14, 2012

OpenCL的并发和执行模型

《OpenCL异构计算》

第五章 OpenCL的并发与执行模型读书笔记

Outline

kernel, work_item, workgroup 概念, 执行域的概念
OpenCL 同步： kernel, fence 和 barrier
队列与全局同步
主机端内存模型
设备端内存模型

kernel, work_item, workgroup 概念, 执行域的概念

基本概念
kernel 实例（线程）= work_item
NDRange 定义并行执行空间。
一个 work_item 定义 one sliver of a large parallel execution space。
一个 work_item 运行在一个 PE (Processing Element) 上。
一个 PE 可处理多个 work_item （在一个 kernel 执行过程中）。
work-group 是固定大小数目的 work_item 的集合。
一个 work-group 运行在一个 CU (Compute Unit) 上。

同步问题
问题1 work-item 之间是否定义同步？如何同步？
答：在运行中，work_item 之间相互独立，OpenCL 未定义（任意两个）work_item 之间的同步。这样做，是为了保证 OpenCL 执行模型的可扩展性。允许一个 work-group 内部的本地同步。

通信问题
问题1 work-group 内部是否允许通信？
答：一定程度允许。但被限制，以提高可扩展性。Local Store？？？

重要接口函数

uint get_work_dim(): Returns the number of dimensions in use in the dispatch.
uint get_global_size(uint dimension): Returns the global number of work items in the dimension requested.
uint get_global_id(uint dimension): Returns the index of the current work item in the global space and in the dimension requested.
uint get_local_size(uint dimension): Returns the size of workgroups in this dispatch in the requested dimension.
uint get_local_id(uint dimension): Returns the index of the current work item as an offset from the beginning of the current workgroup.
uint get_num_groups(uint dimension): Returns the number of workgroups in the specified dimension of the dispatch. This is get_global_size divided by get_local_size.
uint get_group_id(uint dimension): Returns the index of the current work group. That is, the global index of the first work-item in the workgroup, dividing by the workgroup size.

OpenCL 同步： kernel, fence 和 barrier

OpenCL 同步的执行方式要点如下：

一个 work-item 的读操作和另一个 work-item 的写操作没有任何顺序保证。
OpenCL 采用宽松的同步模型和内存一致性模型。
为了获得松弛一致性，The solution is that OpenCL explicitly defines synchronization points where the programmer knows with certainty what the state of some part of the system is and can rely on that information to obtain expectations of behavior.
在 GPU 上，停止一个 wavefront 的运行不会释放它所占用的资源。因此，在 GPU 上，如果采用操作系统那种可以随时停止或者调度线程执行的方式，则可能出现这样一种状况：一个已经调度在设备上的 wavefront 需要得到另外一个还没有调度到设备上的 wavefront 释放信号量才能开始运行。为避免这种情况，全局同步只定义在 kernel 执行的边界处。
不同 work-group 的两个 work-item，并没有定义方法来确保它们的执行顺序。
数据共享（通信），主要是实现了同一个work-group 的 work-item 之间，通过 local memory 来进行数据共享。
OpenCL 定义了一个 work-group 内部的同步操作。通过 barrier 函数。 barrier(CLK_LOCAL_MEM_FENCE);
在 kernel 代码中，如果某个分支调用了 barrier 来同步，则必须保证同一个 work-group 中的所有 work-item 都会执行该分支。如果有 work-item 未执行包含该 barrier 调用的分支，则 barrier 的行为是不确定的。在很多设备上，会导致死锁。

任务队列与全局同步

OpenCL内存一致性
事件
事件回调
命令 barrier 与 mareker

队列
问题1 能够加入 OpenCL 队列的命令有哪些？

kernel 执行命令
内存操作命令
同步命令

问题2 入队的方式是？

答：kernel 执行命令和同步命令都是异步入队。

全局同步

特点：一个命令的完成只能在同步点得到保证。

问题1 有哪些同步点？

调用 clFinish
等待一个特定事件的完成
执行一个阻塞访存操作

内存一致性

问题1 同步命令的作用是？

答：保证在同步点上， work 的同步和内存一致的视图。这意味着命令之间的一致性，也即通信的正确性在一个有序队列的各命令间或者在产生事件和等待该事件的命令之间是有保证的。

问题2 主机 host 与设备 device 之间的一致性如何保证？

答：只能通过阻塞操作达到目的。问题1中一致性保证是在运行时保持内存对象的一致性，而对于主机端 API 来说是不可见的。

问题3 设备之间的一致性如何保证？

答：内存对象是与上下文相关而不是与设备相关，所以在数据被共享和发生相应的事件时，OpenCL 运行时要保证这些对象在不同设备之间的一致性。

Data is moved from one device to another such that if a kernel is to be executed on a second device, any results generated on the first will be available when necessary. The completion of an event on the first data structure is the guarantee that the data is OK to move and no separate buffer copy operation is

needed. 这2句话如何理解？

事件

问题1 构建命令队列的参数问题

`cl_command_queue clCreateCommandQueue(`	cl_context `context`,
	cl_device_id `device`,
	cl_command_queue_properties `properties`,
	cl_int `*errcode_ret)`

cl_command_queue_properties 设置可以是

Command-Queue Properties Description

`CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE` Determines whether the commands queued in the command-queue are executed in-order or out-of-order. If set, the commands in the command-queue are executed out-of-order. Otherwise, commands are executed in-order.

`CL_QUEUE_PROFILING_ENABLE` Enable or disable profiling of commands in the command-queue. If set, the profiling of commands is enabled. Otherwise profiling of commands is disabled. See clGetEventProfilingInfo for more information.

Command-Queue Properties	Description
`CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE`	Determines whether the commands queued in the command-queue are executed in-order or out-of-order. If set, the commands in the command-queue are executed out-of-order. Otherwise, commands are executed in-order.
`CL_QUEUE_PROFILING_ENABLE`	Enable or disable profiling of commands in the command-queue. If set, the profiling of commands is enabled. Otherwise profiling of commands is disabled. See clGetEventProfilingInfo for more information.

直接使用事件是与主机同步的第三种方法。OpenCL 通过构建一个任务图来指明各事件之间的依赖关系。也可以调用 clWaitForEvents 来指定一个 wait 事件。

问题2 如何管理多个设备上的命令队列
答：可以将多个队列映射到相同的设备上，实现不同命令之间的重叠执行或者命令与 host-device 通信操作之间的重叠执行。在一个有多个设备的系统中，每个设备都需要自己的命令队列。在关联了多个设备的同一个上下文中，可为每个设备创建自己的命令队列。而且可以使用事件来实现命令间的同步。如果多个设备各自有各自自己的上下文，则不能使用事件进行同步。此时，只能同步依靠 clFinish 实现，或是显式的在对象间拷贝数据。

问题3 OpenCL异构编程框架下有多少中执行模型？

两个或多个流水执行

多个独立并行

问题4 除同步外，事件还有哪些用途？

答：可用于查询 OpenCL 命令执行相关的错误状态或者程序剖分数据。通过 getInfo 函数实现。

问题5 用户事件是什么？如何使用？

答：用户自定义的时间。通过 clCreateUserEvent 创建。用于将依赖于其他任意任务的命令入队。用户事件可以像其它事件一样传给入队函数，但其执行状态是用户显式的给出的。

事件回调

问题1 事件回调有啥用？

答：同样是定义回调函数，在事件到达指定状态时触发。事件回调可 用于将新命令入队 或 用于调用主机端函数。回调函数应该是轻量级的。clSetEventCallback 函数的调用位置很重要。必须位于产生其所需的有效事件的函数调用之后。

问题2 事件回调的使用有啥讲究？

答：必须非常小心。

为同一个事件多个执行状态注册的回调函数，在命令状态改变时，不保证这些回调函数按照正确的顺序调用执行。
回调函数是线程安全的，会被异步调用。
当调用开销大的系统调用或阻塞式的 OpenCL API （如 clFinish）时，回调函数的行为是未定义的。

这种使用事件处理的机制的好处是应用级行为和同步操作可在不同厂商提供的 CPU 和 GPU 上用统一的方式进行处理。这样就将设备相关的调优局限于计算核中。

问题3 事件回调之外，还有啥替代方案？

答：可使用 native kernel。将普通 C 函数入队。

命令 barrier 与 marker

问题1 不同命令队列的同步除事件外，还有什么方法？

答：还有一种与 work-group 内部同步类似的方法，就是使用 barrier。marker 与 barrier 类似，但是不阻塞执行。当有事件在等待命令队列中的一些命令完成时，marker 非常有用，但是不会影响到其他命令的执行。还有一个同步原语是 waitForEvents，其行为与 marker 相反。不等待所有任务完成，只等待特定任务完成。

主机端内存模型

buffer 对象
image 对象

OpenCL 内存对象定义在上下文，而不是设备上。通常 buffer 上的操作是没有必要涉及到具体硬件设备的。保证数据在正确的时间到达正确的位置是运行时的工作。

buffer 对象
从 CPU 的意义来看，buffer 对象是一维数组，类似于 C 程序中 malloc 开辟的内存。是连续的，RMA。OpenCL 还定义了 sub-buffer 对象上下文，允许把一个单独的 buffer 切分成许多可以重叠的较小的 buffer，而这些 sub-buffer 的读写复制与 buffer 的操作是相同的。sub-buffer 的重叠，以及他与父buffer 的合并可能会造成别名问题。

imgae 对象
对图像数据进行了优化。体现在：

GPU 的 cache 层次结构以及数据流结构都是为了优化图像数据类型的访问而设计的。
GPU 驱动程序优化了数据布局以提供对硬件的高效访存的支持，特别是使用二维访存模式时。
image 访存支持复杂的数据类型转换以允许数据以广泛的压缩格式进行存储。

不像 buffer，image 不同内存对象之间的位置关系对开发人员来说是不可见的。image 数据结构不仅对开发人员不可见，对 kernel 代码亦完全不可见。只能通过专门的访问函数来存取。

image 数据格式由通道序列和通道类型组成。支持运行时系统和硬件优化。image 对象在设备端不能通过指针进行直接访问，也不能在同一个 kernel 中进行读操作和写操作。消除了锯齿数据的可能性，可以进行安全缓存。

Z序映射方式增加图像数据局部性

设备端内存模型

设备端宽松的内存一致性模型
全局内存
本地内存
常量内存
私有内存

设备端宽松的内存一致性模型

work-item 内部内存操作有可预测的顺序：即任意两个对同一地址的读和写是不会被硬件或者编译器重新排序的。
在同一个 work-group 的不同 work-item 之间，只在 barrier 操作处保证其内存一致性。
在 work-group 之间，在 kernel 执行完成前，不保证内存一致性。

为了使位于同一个或者不同 work-group 中的 work-item 之间开展某种程度的通信，OpenCL 定义了一组 fence 操作。但是即使使用这些 fence 操作也不能保证 work-item 的执行顺序。

read_mem_fence( cl_mem_fence_flags flags )
write_mem_fence( cl_mem_fence_flags flags )
mem_fence( cl_mem_fence_flags flags )

保证不同 work-item 之间内存操作正确的另一个方法是原子操作。可以在不影响其他 work-item 的情况下，保证数据读或读写以及其他操作的正确性。（浮点数原子操作设计复杂）OpenCL 目前仅定义了整数原子操作，如 atomic_add 等。

Tuesday, November 6, 2012

（转）Intel网站上关于OpenCL概念的解释

一些重要OpenCL概念的解释：

Computing unit - An OpenCL* device has one or more compute units. A work-group executes on a single compute unit. A compute unit is composed of one or more processing elements and local memory. A compute unit may also include dedicated texture filter units that can be accessed by its processing elements.

Device - A device is a collection of compute units.

A command-queue is used to queue commands to a device. Examples of commands include executing kernels, or reading and writing memory objects. OpenCL* devices typically correspond to a GPU, a multi-core CPU, and other processors such as DSPs and the Cell/B.E. processor.

A kernel is a function declared in a program and executed on an OpenCL* device. A kernel is identified by the kernel or kernel qualifier applied to any function defined in a program.

Work item - one of a collection of parallel executions of a kernel invoked on a device by a command. A work-item is executed by one or more processing elements as part of a work-group executing on a compute unit. A work-item is distinguished from other executions within the collection by its global ID and local ID.

Work-group - a collection of related work-items that execute on a single compute unit. The work-items in the group execute the same kernel and share local memory and work-group barriers.

Each work-group has the following properties:

Data sharing between work items via local memory
Synchronization between work items via barriers and memory fences
Special work-group level built-in functions, such as work_group_copy.

A multi-core CPU or multiple CPUs (in a multi-socket machine) constitute a single OpenCL* device. Separate cores are compute units. Device Fission extension enables you to control compute unit utilization within a compute device. You can find more information in the ‘Device Fission Extension Support’ section of the Intel® SDK for OpenCL* User’s Guide (see Related Documents).

When launching the kernel for execution, the host code defines the grid dimensions, or the global work size. The host code can also define the partitioning to work-groups, or leave it to the implementation. During the execution, the implementation runs a single work item for each point on the grid. It also groups the execution on compute units according to the work-group size.

The order of execution of work items within a work-group, as well as the order of work-groups, is implementation-specific.

Task-Parallel programming model – the OpenCL* programming model that runs a single work-group with a single work item.

Saturday, November 24, 2012

Friday, November 23, 2012

Saturday, November 17, 2012

《大规模并行编程实战》 12.2节 读书笔记

《OpenCL异构计算》

第十二章 OpenCL 性能剖分和调试 读书笔记

《OpenCL异构计算》

第十一章 OpenCL 扩展 读书笔记

扩展机制概览

设备拆分

双精度

Friday, November 16, 2012

《OpenCL异构计算》

第六章 OpenCL 在 CPU/GPU 平台上的实现

第3节 OpenCL的内存性能 读书笔记

OpenCL 全局内存

本地内存 —— 软件管理的 cache

OpenCL在CPU上的实现

《OpenCL异构计算》 chapter 6 读书笔记

Wednesday, November 14, 2012

《OpenCL异构计算》

第五章 OpenCL的并发与执行模型 读书笔记

kernel, work_item, workgroup 概念, 执行域的概念

OpenCL 同步： kernel, fence 和 barrier

任务队列与全局同步

任务队列与全局同步

cl_command_queue_properties 设置可以是

主机端内存模型

设备端内存模型

Tuesday, November 6, 2012

一些重要OpenCL概念的解释：

《大规模并行编程实战》 12.2节读书笔记

第十二章 OpenCL 性能剖分和调试读书笔记

第十一章 OpenCL 扩展读书笔记

第3节 OpenCL的内存性能读书笔记

第五章 OpenCL的并发与执行模型读书笔记