Ring Buffer(3)

架构设计:生产者/消费者模式[4]:双缓冲区

“双缓冲区”是一个应用很广的手法。该手法用得最多的地方想必是屏幕绘制相关的领域(主要是为了减少屏幕闪烁)。另外,在设备驱动和工控方面,双缓冲也经常被使用。不过今天要聊的,并不是针对上述的某个具体领域,而是侧重于并发方面的同步/互斥开销。另外提醒一下,双缓冲方式和前面提到的队列缓冲、环形缓冲是可以结合使用滴。★为啥要双缓冲区
记得前几天在介绍队列缓冲区时,提及了普通队列缓冲区的两个性能问题:“内存分配的开销”和“同步/互斥的开销”(健忘的同学,先回去看看那个帖子复习一下)。“内存分配的开销”已经在介绍环形缓冲区的时候解决了,而今天要介绍的双缓冲区,就是冲着同步/互斥的开销来的。
为了防止有人给咱扣上“过度设计”的大帽子,又得来一个事先声明:只有当同步或互斥的开销非常明显的时候,你才应该考虑双缓冲区的使用。否则的话,大伙儿还是老老实实用最基本、最简单的队列缓冲区吧。

双缓冲区的原理
前面说了一通废话,现在开始切入正题,说说具体实现。
所谓“双缓冲区”,故名思义就是要有俩缓冲区(简称A和B)。这俩缓冲区,总是一个用于生产者,另一个用于消费者。当俩缓冲区都操作完,再进行一次切换(先前被生产者写入的转为消费者读出,先前消费者读取的转为生产者写入)。由于生产者和消费者不会同时操作同一个缓冲区(不发生冲突),所以就不需要在读写每一个数据单元的时候都进行同步/互斥操作。顺便提一下,这又一次展现了空间换时间的优化思路。
但是光有俩缓冲区还不够。为了真正做到“不冲突”,还得再搞两个互斥锁(简称La和Lb),分别对应俩缓冲区。生产者或消费者如果要操作某个缓冲区,必须先拥有对应的互斥锁。补充一句:要达到“不冲突”的效果,其实可以有多种搞法,今天只是挑一个简单的来聊。

双缓冲区的几种状态
为了加深某些同学的理解,再描述一下双缓冲区的几种状态。
◇俩缓冲区都在使用的状态(并发读写)
大多数情况下,生产者和消费者都处于并发读写状态。不妨设生产者写入A,消费者读取B。在这种状态下,生产者拥有锁La;同样的,消费者拥有锁Lb。由于俩缓冲区都是处于独占状态,因此每次读写缓冲区中的元素(数据单元)都不需要再进行加锁、解锁操作。这是节约开销的主要来源。
◇单个缓冲区空闲的状态
由于两个并发实体的速度会有差异,必然会出现一个缓冲区已经操作完,而另一个尚未操作完。不妨假设生产者快于消费者。
在这种情况下,当生产者把A写满的时候,生产者要先释放La(表示它已经不再操作A),然后尝试获取Lb。由于B还没有被读空,Lb还被消费者持有,所以生产者进入发呆(Suspend)状态。
◇缓冲区的切换
接着上面的话题。
过了若干时间,消费者终于把B读完。这时候,消费者也要先释放Lb,然后尝试获取La。由于La刚才已经被生产者释放,所以消费者能立即拥有La并开始读取A的数据。而由于Lb被消费者释放,所以刚才发呆的生产者会缓过神来(Resume)并拥有Lb,然后生产者继续往B写入数据。
经过上述几个步骤,俩缓冲区完成了对调,变为:生产者写入B,消费者读取A。

可能的并发问题
本来单个缓冲区的生产者/消费者问题就已经是教科书的经典问题了,现在搞出俩缓冲区,所以就更加耗费脑细胞了。一不小心,就会搞出些并发的Bug,而且并发的Bug还很难调试和测试(这也就是为啥不要轻易使用该玩意儿的原因)。
◇死锁的问题
假如把前面介绍的操作步骤调换一下顺序:生产者或消费者在操作完当前的缓冲区之后,先去获取另一个缓冲区的锁,再来释放当前缓冲区的锁。那会咋样捏?
一旦两个并发实体同时处理完各自缓冲区,然后同时去获取对方拥有的锁,那就会出现典型的死锁(死锁的详细解释参见“这里”)场景。它俩从此陷入万劫不复的境地。

应用场景
介绍完并发问题,按照本系列的惯例,最后再来介绍一下双缓冲区在某些场合的应用。
◇用于并发线程
在线程方式下,首先要考虑的是缓冲区的类型:到底用队列方式还是环形方式。这方面的选择依据在介绍环形缓冲区的时候已经阐述过了,此处不再啰嗦(省去不少口水)。
另一个需要注意的是,某些编程语言或者程序库提供了的线程安全的缓冲区(比如JDK 1.5引入的ArrayBlockingQueue)。由于这种缓冲区会自动为每次的读写进行同步/互斥,所以就把双缓冲的优势抵消掉了。因此,大伙儿在进行缓冲区选型的时候要避开这类缓冲区。
◇用于并发进程
在进程间使用双缓冲,先得考察不同IPC类型的特点。由于今天讨论双缓冲的目的是降低同步/互斥的开销,对于那些已经封装了同步/互斥的IPC类型,就没太大必要再去搞双缓冲了(单凭这条就足以让好多种IPC出局)。剩下的IPC类型中,比较适合用双缓冲的主要是:共享内存和文件。非常凑巧,这两个玩意儿的特点和适用范围在环形缓冲区的帖子里面也已经介绍过了

发表在 Research results | 发表评论

Ring Buffer(2)

架构设计:生产者/消费者模式[3]:环形缓冲区

前一个帖子提及了队列缓冲区可能存在的性能问题及解决方法:环形缓冲区。今天就专门来描述一下这个话题。
为了防止有人给咱扣上“过度设计”的大帽子,事先声明一下:只有当存储空间的分配/释放非常频繁并且确实产生了明显的影响,你才应该考虑环形缓冲区的使用。否则的话,还是老老实实用最基本、最简单的队列缓冲区吧。还有一点需要说明一下:本文所提及的“存储空间”,不仅包括内存,还可能包括诸如硬盘之类的存储介质。★环形缓冲区 vs 队列缓冲区
◇外部接口相似
在介绍环形缓冲区之前,咱们先来回顾一下普通的队列。普通的队列有一个写入端和一个读出端。队列为空的时候,读出端无法读取数据;当队列满(达到最大尺寸)时,写入端无法写入数据。
对于使用者来讲,环形缓冲区和队列缓冲区是一样的。它也有一个写入端(用于push)和一个读出端(用于pop),也有缓冲区“满”和“空”的状态。所以,从队列缓冲区切换到环形缓冲区,对于使用者来说能比较平滑地过渡。
◇内部结构迥异
虽然两者的对外接口差不多,但是内部结构和运作机制有很大差别。队列的内部结构此处就不多啰嗦了。重点介绍一下环形缓冲区的内部结构。
大伙儿可以把环形缓冲区的读出端(以下简称R)和写入端(以下简称W)想象成是两个人在体育场跑道上追逐。当R追上W的时候,就是缓冲区为空;当W追上R的时候(W比R多跑一圈),就是缓冲区满。
为了形象起见,去找来一张图并略作修改,如下:
不见图、请翻墙
从上图可以看出,环形缓冲区所有的push和pop操作都是在一个固定的存储空间内进行。而队列缓冲区在push的时候,可能会分配存储空间用于存储新元素;在pop时,可能会释放废弃元素的存储空间。所以环形方式相比队列方式,少掉了对于缓冲区元素所用存储空间的分配、释放。这是环形缓冲区的一个主要优势。

环形缓冲区的实现
如果你手头已经有现成的环形缓冲区可供使用,并且你对环形缓冲区的内部实现不感兴趣,可以跳过这段。
◇数组方式 vs 链表方式
环形缓冲区的内部实现,即可基于数组(此处的数组,泛指连续存储空间)实现,也可基于链表实现。
数组在物理存储上是一维的连续线性结构,可以在初始化时,把存储空间一次性分配好,这是数组方式的优点。但是要使用数组来模拟环,你必须在逻辑上把数组的头和尾相连。在顺序遍历数组时,对尾部元素(最后一个元素)要作一下特殊处理。访问尾部元素的下一个元素时,要重新回到头部元素(第0个元素)。如下图所示:
不见图、请翻墙
使用链表的方式,正好和数组相反:链表省去了头尾相连的特殊处理。但是链表在初始化的时候比较繁琐,而且在有些场合(比如后面提到的跨进程的IPC)不太方便使用。
◇读写操作
环形缓冲区要维护两个索引,分别对应写入端(W)和读取端(R)。写入(push)的时候,先确保环没满,然后把数据复制到W所对应的元素,最后W指向下一个元素;读取(pop)的时候,先确保环没空,然后返回R对应的元素,最后R指向下一个元素。
◇判断“空”和“满”
上述的操作并不复杂,不过有一个小小的麻烦:空环和满环的时候,R和W都指向同一个位置!这样就无法判断到底是“空”还是“满”。大体上有两种方法可以解决该问题。
办法1:始终保持一个元素不用
当空环的时候,R和W重叠。当W比R跑得快,追到距离R还有一个元素间隔的时候,就认为环已经满。当环内元素占用的存储空间较大的时候,这种办法显得很土(浪费空间)。
办法2:维护额外变量
如果不喜欢上述办法,还可以采用额外的变量来解决。比如可以用一个整数记录当前环中已经保存的元素个数(该整数>=0)。当R和W重叠的时候,通过该变量就可以知道是“空”还是“满”。
◇元素的存储
由于环形缓冲区本身就是要降低存储空间分配的开销,因此缓冲区中元素的类型要选好。尽量存储值类型的数据,而不要存储指针(引用)类型的数据。因为指针类型的数据又会引起存储空间(比如堆内存)的分配和释放,使得环形缓冲区的效果打折扣。

应用场合
刚才介绍了环形缓冲区内部的实现机制。按照前一个帖子的惯例,我们来介绍一下在线程和进程方式下的使用。
如果你所使用的编程语言和开发库中带有现成的、成熟的环形缓冲区,强烈建议使用现成的库,不要重新制造轮子;确实找不到现成的,才考虑自己实现。如果你纯粹是业余时间练练手,那另当别论。
◇用于并发线程
和线程中的队列缓冲区类似,线程中的环形缓冲区也要考虑线程安全的问题。除非你使用的环形缓冲区的库已经帮你实现了线程安全,否则你还是得自己动手搞定。线程方式下的环形缓冲区用得比较多,相关的网上资料也多,下面就大致介绍几个。
对于C++的程序员,强烈推荐使用boost提供的circular_buffer模板,该模板最开始是在boost 1.35版本中引入的。鉴于boost在C++社区中的地位,大伙儿应该可以放心使用该模板。
对于C程序员,可以去看看开源项目circbuf,不过该项目是GPL协议的,不太爽;而且活跃度不太高;而且只有一个开发人员。大伙儿慎用!建议只拿它当参考。
对于C#程序员,可以参考CodeProject上的一个示例
◇用于并发进程
进程间的环形缓冲区,似乎少有现成的库可用。大伙儿只好自己动手、丰衣足食了。
适合进行环形缓冲的IPC类型,常见的有共享内存和文件。在这两种方式上进行环形缓冲,通常都采用数组的方式实现。程序事先分配好一个固定长度的存储空间,然后具体的读写操作、判断“空”和“满”、元素存储等细节就可参照前面所说的来进行。
共享内存方式的性能很好,适用于数据流量很大的场景。但是有些语言(比如Java)对于共享内存不支持。因此,该方式在多语言协同开发的系统中,会有一定的局限性。
而文件方式在编程语言方面支持很好,几乎所有编程语言都支持操作文件。但它可能会受限于磁盘读写(Disk I/O)的性能。所以文件方式不太适合于快速数据传输;但是对于某些“数据单元”很大的场合,文件方式是值得考虑的。
对于进程间的环形缓冲区,同样要考虑好进程间的同步、互斥等问题,限于篇幅,此处就不细说了。

发表在 Research results | 发表评论

Ring buffer(1)

最近在看Xen IPC相关的东西。里面谈到了Xen主要用I/O Ring buffer来进行不同的device driver直接communication。(The Definitive Guide to the Xen Hypervisor, page86)

看到一个blog里的大牛对ring buffer以及buffer相关讲的十分透彻,对理解起了很大帮助。因为只看了ring buffer部分,后面的还没看完,先特意转贴过来    :)

PS: 第一部分讲的比较基础,就没转了,有时间和兴趣的可以去看

http://program-think.blogspot.com/2009/03/producer-consumer-pattern-0-overview.html

http://program-think.blogspot.com/2009/02/multi-process-vs-multi-thread.html

=================================================

经过前面两个帖子的铺垫,今天终于开始聊一些具体的编程技术了。由于不同的缓冲区类型、不同的并发场景对于具体的技术实现有较大的影响。为了深入浅出、便于大伙儿理解,咱们先来介绍最传统、最常见的方式。也就是单个生产者对应单个消费者,当中用队列(FIFO)作缓冲。
关于并发的场景,在之前的帖子“进程还线程?是一个问题!”中,已经专门论述了进程和线程各自的优缺点,两者皆不可偏废。所以,后面对各种缓冲区类型的介绍都会同时提及进程方式和线程方式。

线程方式
先来说一下并发线程中使用队列的例子,以及相关的优缺点。
内存分配的性能
在线程方式下,生产者和消费者各自是一个线程。生产者把数据写入队列头(以下简称push),消费者从队列尾部读出数据(以下简称pop)。当队列为空,消费者就稍息(稍事休息);当队列满(达到最大长度),生产者就稍息。整个流程并不复杂。
那么,上述过程会有什么问题捏?一个主要的问题是关于内存分配的性能开销。对于常见的队列实现:在每次push时,可能涉及到堆内存的分配;在每次pop时,可能涉及堆内存的释放。假如生产者和消费者都很勤快,频繁地push、pop,那内存分配的开销就很可观啦!对于内存分配的开销,用Java的同学可以参见前几天的帖子“Java性能优化[1]”;对于用C/C++的同学,想必对OS底层机制会更清楚,应该知道分配堆内存(new或malloc)会有加锁的开销和用户态/核心态切换的开销。
那该怎么办捏?请听下文分解,关于“生产者/消费者模式[3]:环形缓冲区”。
同步和互斥的性能
另外,由于两个线程共用一个队列,自然就会涉及到线程间诸如同步啊、互斥啊、死锁啊等等劳心费神的事情。好在”操作系统”这门课程对此有详细介绍,学过的同学应该还有点印象吧?对于没学过这门课的同学,也不必难过,网上相关的介绍挺多的(比如”这里“),大伙自己去瞅一瞅。关于这方面的细节,咱今天就不多啰嗦了。
这会儿要细谈的是,同步和互斥的性能开销。在很多场合中,诸如信号量、互斥量等玩意儿的使用也是有不小的开销的(某些情况下,也可能导致用户态/核心态切换)。如果像刚才所说,生产者和消费者都很勤快,那这些开销也不容小觑啊。
这又该咋办捏?请听下文的下文分解,关于“生产者/消费者模式[4]:双缓冲区”。
适用于队列的场合
刚才尽批判了队列的缺点,难道队列方式就一无是处?非也。由于队列是很常见的数据结构,大部分编程语言都内置了队列的支持(具体介绍见”这里“),有些语言甚至提供了线程安全的队列(比如JDK 1.5引入的ArrayBlockingQueue)。因此,开发人员可以捡现成,避免了重新发明轮子。
所以,假如你的数据流量不是很大,采用队列缓冲区的好处还是很明显的:逻辑清晰、代码简单、维护方便。比较符合KISS原则。

进程方式
说完了线程的方式,再来介绍基于进程的并发。
跨进程的生产者/消费者模式,非常依赖于具体的进程间通讯(IPC)方式。而IPC的种类名目繁多,不便于挨个列举(毕竟口水有限)。因此咱们挑选几种跨平台、且编程语言支持较多的IPC方式来说事儿。
匿名管道
感觉管道是最像队列的IPC类型。生产者进程在管道的写端放入数据;消费者进程在管道的读端取出数据。整个的效果和线程中使用队列非常类似,区别在于使用管道就无需操心线程安全、内存分配等琐事(操作系统暗中都帮你搞定了)。
管道又分命名管道匿名管道两种,今天主要聊匿名管道。因为命名管道在不同的操作系统下差异较大(比如Win32和POSIX,在命名管道的API接口和功能实现上都有较大差异;有些平台不支持命名管道,比如Windows CE)。除了操作系统的问题,对于有些编程语言(比如Java)来说,命名管道是无法使用的。所以俺一般不推荐使用这玩意儿。
其实匿名管道在不同平台上的API接口,也是有差异的(比如Win32的CreatePipe和POSIX的pipe,用法就很不一样)。但是我们可以仅使用标准输入和标准输出(以下简称stdio)来进行数据的流入流出。然后利用shell的管道符把生产者进程和消费者进程关联起来(没听说过这种手法的同学,可以看”这里“)。实际上,很多操作系统(尤其是POSIX风格的)自带的命令都充分利用了这个特性来实现数据的传输(比如more、grep等)。
这么干有如下几个好处:
1、基本上所有操作系统都支持在shell方式下使用管道符。因此很容易实现跨平台。
2、大部分编程语言都能够操作stdio,因此跨编程语言也就容易实现。
3、刚才已经提到,管道方式省却了线程安全方面的琐事。有利于降低开发、调试成本。
当然,这种方式也有自身的缺点:
1、生产者进程和消费者进程必须得在同一台主机上,无法跨机器通讯。这个缺点比较明显。
2、在一对一的情况下,这种方式挺合用。但如果要扩展到一对多或者多对一,那就有点棘手了。所以这种方式的扩展性要打个折扣。假如今后要考虑类似的扩展,这个缺点就比较明显。
3、由于管道是shell创建的,对于两边的进程不可见(程序看到的只是stdio)。在某些情况下,导致程序不便于对管道进行操纵(比如调整管道缓冲区尺寸)。这个缺点不太明显。
4、最后,这种方式只能单向传数据。好在大多数情况下,消费者进程不需要传数据给生产者进程。万一你确实需要信息反馈(从消费者到生产者),那就费劲了。可能得考虑换种IPC方式。
顺便补充几个注意事项,大伙儿留意一下:
1、对stdio进行读写操作是以阻塞方式进行。比如管道中没有数据,消费者进程的读操作就会一直停在哪儿,直到管道中重新有数据。
2、由于stdio内部带有自己的缓冲区(这缓冲区和管道缓冲区是两码事),有时会导致一些不太爽的现象(比如生产者进程输出了数据,但消费者进程没有立即读到)。具体的细节,大伙儿可以看”这里“。

SOCKET(TCP方式)
基于TCP方式的SOCKET通讯是又一个类似于队列的IPC方式。它同样保证了数据的顺序到达;同样有缓冲的机制。而且这玩意儿也是跨平台和跨语言的,和刚才介绍的shell管道符方式类似。
SOCKET相比shell管道符的方式,有啥优点捏?请看:
1、SOCKET方式可以跨机器(便于实现分布式)。这是主要优点。
2、SOCKET方式便于将来扩展成为多对一或者一对多。这也是主要优点。
3、SOCKET可以设置阻塞和非阻塞方法,用起来比较灵活。这是次要优点。
4、SOCKET支持双向通讯,有利于消费者反馈信息。
当然有利就有弊。相对于上述shell管道的方式,使用SOCKET在编程上会更复杂一些。好在前人已经做了大量的工作,搞出很多SOCKET通讯库和框架给大伙儿用(比如C++的ACE库、Python的Twisted)。借助于这些第三方的库和框架,SOCKET方式用起来还是比较爽的。由于具体的网络通讯库该怎么用不是本系列的重点,此处就不细说了。
虽然TCP在很多方面比UDP可靠,但鉴于跨机器通讯先天的不可预料性(比如网线可能被某傻X给拔错了,网络的忙闲波动可能很大),在程序设计上我们还是要多留一手。具体该如何做捏?可以在生产者进程和消费者进程内部各自再引入基于线程的”生产者/消费者模式”。这话听着像绕口令,为了便于理解,画张图给大伙儿瞅一瞅。

不见图、请翻墙
这么做的关键点在于把代码分为两部分:生产线程和消费线程属于和业务逻辑相关的代码(但和通讯逻辑无关);发送线程和接收线程属于通讯相关的代码(但和业务逻辑无关)。
这样的好处是很明显的,具体如下:
1、能够应对暂时性的网络故障。并且在网络故障解除后,能够继续工作。
2、网络故障的应对处理方式(比如断开后的尝试重连),只影响发送和接收线程,不会影响生产线程和消费线程(业务逻辑部分)。
3、具体的SOCKET方式(阻塞和非阻塞)只影响发送和接收线程,不影响生产线程和消费线程(业务逻辑部分)。
4、不依赖TCP自身的发送缓冲区和接收缓冲区。(默认的TCP缓冲区的大小可能无法满足实际要求)
5、业务逻辑的变化(比如业务需求变更)不影响发送线程和接收线程。
针对上述的最后一条,再多啰嗦几句。如果整个业务系统中有多个进程是采用上述的模式,那或许可以重构一把:在业务逻辑代码和通讯逻辑代码之间切一刀,把业务逻辑无关的部分封装成一个通讯中间件(说中间件显得比较牛X :-)。如果大伙儿对这玩意儿有兴趣,以后专门开个帖子聊。

发表在 Research results | 发表评论

一些开发总结。

把xen分组工作搞完,记录下这种对原有的开源系统进行改造的一些心得和想法:

如果要增加或改进一个现有系统中的一个机制。你一定要彻底熟悉他现有那套机制的整个流程。可以先从整个系统的大概基本工作原理和机制开始着手,然后从中整理出你需要用的那些部位。另外,熟悉了解的过程一定要通过源代码+各种文档资料一起。这样效率才高,2者脱离任何一个都会浪费很多不必要的时间。

关于你要改进或添加的那个部分里面是哪些代码实现的,他们之间关系,以及相应引发的其他机制。要画流程图或者写清开发步骤文档,做到心中有数。然后再顺着uml图或者开发文档开始代码实现。这些话看起来这些很简单。可自己真真正正弄懂xen的大概机制就弄了好长时间。期间也很迷茫因为对于这种现有系统里面的代码机制修改也不知该从何开始导致瞎折腾了很久,走了不少弯路,也痛苦了很久。

还有,对一些编程基础知识,还是看中文资料理解的快和记的牢。但是涉及到一定理解程度上的知识还是得耐心看英文文档。并且要学会做笔记,抓关键词,要能用自己的话能复述和总结。

暂时就这些,有了这些经验后我想有机会和时间的话我要接下来吧eucalyptus的high available分布式服务器实现了。也算是一个给自己的交差吧。

发表在 Thought | 发表评论

libvirt API

Hypervisor Connections

A connection is the primary or top level object in the libvirt API.

The connection is represented with the virConnectPtr object and identified by a URI.

Guest domains

the connection object provides APIs to enumerate the guest domains, create new guest domains and manage existing domains. A guest domain is represented with virDomainPtr object and has a number of unique identifiers(URI).

Libvirt uses Uniform Resource Identifiers (URIs) to identify hypervisor connections. Both local and remote hypervisor are addresses by libvirt using URIs. The URI scheme and path defines the hypervisor to connect to, while the host part of the URI determines where it is located.

Connect to a remote Xen hypervisor on host node.example.com using ssh tunneled data transport and ssh username root:

xen+ssh://root@node.example.com/

Driver model

Each implementation is refereed to as a driver.

When obtaining a instance of the virConnectPtr(which is connecting to hypervisor)object, the application developer can provide a URI to determine which hypervisor driver is activated.

Hypervisor drivers

*Xen: a single system driver runs in the Dom0 host talking directly to a combination of the hypervisor, xenstored and xend.

Example local URI scheme xen:///

* Remote: to secure RPC service talking to a libvirtd daemon.

Encryption and authentication using a choice of TLS, x509 certificates, SASL(GSSAPI/Kerberos) and SSH tunneling. URIs follow the scheme of the desired driver, but with a hostname filled in, and a data transport name appended to the URI scheme.

Example URI to talk to Xen over a TLS channel xen+tsls://somehostname/

  • virConnectPtr: represent a connection to an hypervisor. 
  • virDomainPtr: represent one domain either active or defined (i.e. Existing as permanent config file and storage but not currently running on that node).
  • The function virConnectListDomains allows to list all the IDs for the domains active on this hypervisor.  
  • virNetWorkPtr: represent one network either active or defined.
    The function virConnectListNetworks allows to list all the virtualization networks actived on this node.
     
  • virStorageVolPtr: represent one storage volume, ususally this is used as a block device available to one of the domains. The function virStorageVolLookupByPath allows to find the object based on its path on the node.

     

  • virStoragePoolPtr: represent a storage pool, i.e. a logical area which can be used to allocate and store storage volumes. The function virStoragePoolLookupByVolume allows to find the storage pool containing a given storage volume.

Libvirt driver

LXC: The native Linux container based virtualization technology, available with Linux kernels since 2.6.25. A single privileged system driver runs in the host talking to the kernel. Example privileged URI scheme lxc:///

Python API bindings

The python binding should be complete and are mostly automatically generated from the formal description of the API in XML.
The bindings are articulated around 2 classes virConnect and virDomain mapping to the C types.
Functions in the C API taking either type as argument then becomes mehods for the classes, their name is just stipped from the virConnect or virDomain(Get) prefix and the first letter gets converted to lower case.
for example the C functions:
int virConnectNumbOfDomains (virConnectPtr conn);
int virDomainSetMaxMemory (virDomainPtr domain, unsigned long memory);
************Become ************
virConn :: numOfDomains(self)
virDomain :: setMaxMemory(self, memory)
virEventRemoveHandle
Security Connection
A connection will be classified secure if it is either encrypted or it is running on a channel which is not vulnerable to eavsdropping.
Event loop integration
the libvirt APIs use a basic request/response architecture that is generally synchronous.
That is, a libvirt application calls a libvirt API(the request) which doesn’t return until the action is complete(the response).
However, a libvirtd server can also generate asynchronous messages and send them to the libvirt application;
a typical usage of these messages is to inform the libvirt client when a domain has undergone a lifecycle change(like shutdown or restart).
The libvirt event loop APIs allow an application to register for these asynchronous events and properly handle them.
The virEventRegisterImpl API registers a set of callbacks that libvirt will call when adding, updating, or removing a handle to watch.
Glossary of Terms
Domain – an instance of an operating system running on a virtualized machine provided by the hypervisor.
Hypervisor – a layer of software allowing virtualization of a node in a set of virtual machines, which may have different configurations to the node itself.
Node – a single physical server. Nodes may be any one of many different type, and are commonly referred to by their primary purpose. Examples are storage nodes, cluster nodes, and database nodes.
Storage Pool – a collection of storage media, such as physical hard drives. A storage pool is sub-divided into smaller containers called Volumes, which may then be allocated to one or more Domains.
Volume – a stroage space, allocated from a Storage Pool. A Volume may be assigned to one or more Domains for use, and are commonly used inside Domains as virtual hard drives.


CGroups

Using CGroups with libvirt

CGroups is a generic mechanism the kernel provides for grouping of processes and applying controls to those groups.

The grouping is done via a virtual file-system called “cgroup”.

Within this file-system, each directory defines a new group.

因此,groups通过创建一个新的子目录去重组成一个任意的嵌套式的层次体系。

Thus groups can be arranged to form an arbitrarily nested hierarchy simply by creating new sub-directories.

Tunables(改变参数) within a cgroup are provide by what the kernel calls ‘controllers, with each controller able to expose one or more tunable or control.

When mounting the cgroups files-system it is possible to indicate what controllers are to be activated.

With each mount point having a different set of(non-overlapping) controllers.

This key idea is that this allows the administrator to construct differing group hierarchies for different sets of controller/tunables.

memory: Memory controller
Allows for setting limits on RAM and swap usage and querying cumulative usage of all processes in the group.
cpuset: CPU set controller
Binding of processes within a group to a set of CPUs and controlling migration between CPUs.
cpuacct: CPU accounting controller
Information about CPU usage for a group of processes.
cpu: CPU schedule controller
Controlling the prioritization of processes in the group.
devices: Devices controller
Access control lists on character and block devices.
freezer: Freezer controller
Pause and resume execution of processes in the group. Think of it as SIGSTOP for the whole group.
net_cls: Network class controller
Control network utilization by associating processes with a ‘tc’ network class.
/libvirt/src/xenxs/xen_xm.c
— Xen XM parsing functions, which basically grab information from config file object. and then turn a config record into a lump of XML describing the domain, suitable for later feeding for virDomainCrateXML
/libvirt/src/xenxs/xen_sxpr.c
— Xen SEXPR(S-Expression rountines, Security model expression routines) parsing functions,
/libvirt/src/util/
— A collection of shared APIs that can be used by any code. this directory is always in the include path for all things built.
authhelper.c: authentication related utility functions
cgroup.c: Tools for managing cgroups.
buf.c: buffers for libvirt
command.c: Child command execution
conf.c: parser for a subset of the Python encoded Xen configuration files
dnsmasq.c && ebtables.c : based on iptables.c
event.c: event loop for monitoring file handles
event_poll.c: event loop for monitoring file handles
hash.c: chained hash tables for domain and domain/connection deallocations
hooks.c: implementation of the synchronous hooks support
interface.c: interface support functions // chgIfaceFlags originated from bridge.c
iohelper.c: Helper program to perform I/O operations on files
json.c: JSON object parsing/formatting
memory.c: safer memory allocation
network.c: network helper APIs for libvirt
sexpr.c : S-Expression routines to communicate with the Xen Daemon

发表在 Research results | 发表评论

Xen Store and Xen Bus

XenStore
XenStore is an information storage space shared between domains. It is meant for configuration and status information rather than large data transfers.
Each domain gets its own path in the store, which is somewhat similar in spirit to procfs. (or the proc files system)
(Under Linux, /proc includes a directory for each running process<include kernel processes> at /proc/PID, containing information about that process)
When values are changed in the store, the appropriate drivers are notified.
XenBus
XenBus provides a bus abstraction for paravirtualized drivers to communicate between domains.
In practice, the bus is used for configuration negotiation, leaving most data transfer to be done via an inter-domain channel composed of a shared page and an event channel.
Bringing Xenbus and Xenstore functionality into the code for driver and tool development.
Using XenStore
1. Reading and Writing data
Xenstore API provides methods for manipulating data in the store that resemble standard C functions:
xenbus_printf(DIR, NODE, FORMAT, ...)
xenbus_scanf(DIR, NODE, FORMAT, ...)
xenbus_rm(DIR, NODE)
xenbus_read(DIR, NODE, &LEN)
These read and write data in the store.
When reading a variable-length string value from a store key, xenbus_read() should be used, which returns a kmalloc’d buffer containing the string. This string should be kfree’d after use.
2. Transactions
To ensure multiple operations on the Xenstore are seen as a single atomic operation. Any time multiple operations must be performed before any changes are seen by watchers, a transaction must be used to encapsulate the changes: For example:
xenbus_transaction_start("mydir");
xenbus_printf("mydir", "command", "%s", "do_something");
xenbus_printf("mydir", "arg", "%s", "14");
xenbus_transaction_end(0);
3. Watches
a “watch” can be placed on a key in the XenStore which causes a callback function to be executed whenever something changes ator below the level where the watch was placed.
This allows drivers or applications to respond immediately to changes in the store.
For example, the ballon driver watches “memory/target” and immediately attempts to balloon the domain’s memory whenever a new target is written to the key:
static struct xenbus_watch xb_watch = {

 .node = "memory",
 .callback = watch_target;
};

ret = register_xenbus_watch(&xb_watch);
 if(IS_ERR(ret)) //IS_ERR()-- 判断指针是否指向错误
 { 
 IPRINTK("Failed to initialize balloon watcher\n");
 } else 
 {
 IPRINTK("Balloon xenbus watcher initialized\n");
 }
Ballon driver:
the balloon driver allocates memory using standarad kernel allocation functions, then gives it back to Xen.
The memory is no longer available to use in the guest at this point(being allocated by balloon driver), and the fact that it was allocated by the balloon driver prevents the rest of the kernel from trying to use it.
When the balloon driver reclaims(收回) memory, it gets it back from Xen and then does a “free”, allowing the kernel memory allocator or give it to other drivers fro normal use.
The balloon driver contains a load of book keeping regarding these operations, certain mechanics to actually make the whole thing work properly and a few other memory operations that are used by Xen drivers.
From the kernel’s perspective, however, that memory is still part of the Xen system and is being used by the ballon drivers.
Interacting with the XenStore
The XenStore interface provides transaction based reads and writes to points in the xenstore hierarchy.
Watches can be set at points in the hierachy and an individual watch will be triggered when anything at or below that point in the hierachy changes.
A watch is registerd with a callback function and a “token”.
The “token” can be a pointer to any piece of data;
The callback function is invoked with the of the changed node and the “token”.
发表在 Research results | 发表评论

Xen-MiniOS Kernel

pyGrub
pyGrub enables Xen to start Linux Dom Us with the kernels that lie in the files-system of the Dom U instead of with a kernel that lies in the files-system of the Dom 0.
pyGrub behaves like Grub and reads the standard Grub menu.list to provide the xen create process with the required parameters and SXP stanzas. (SXP — Simple XML Persistence)
Stub Domains
a stub domain is a helper domain which runs Xen components. It is based on Mini-OS.
It can be use for domain builder, device model, PV-GRUB, etc.
device model stubdom boosts HVM performance by putting ioemu in its own lightweight domain.
Mini-OS Used to be just a small PV kernel serving as a sample of how a PV guest works.
It has been extended up to being able to run the newlib C library and the lwIP stack, thus providing a basic POSIX enviroment, including TCP/IP networking.
This permits to quite easily embed an application in a dedicated Xen domain by just recompiling it against that environment.
Everything gets linked together as a kernel which can then just be started like any PV guest kernel. It is now possible to have the device model and grub running in their own domains.
* Helper domains which run Xen components;
* Based on Mini-OS;
* Domain Builder
* Device Model
* PV-GRUB
… …
Mini-OS Kernel
1. Introduction
Mini-OS is a small OS kernel distributed with the Xen-hyperviosr source.
2. Mini-OS Initialization
Mini-OS boots at the symbol _start in /xen-4.0.1/xen/arch/boot/x86_32.S
it begins by loading SS(stack segment) and ESP(stack pointer) registers with
the address stored at stack_start.
KERNEL_SS is a default segment descriptor provided in the GDT(Global Descriptor Table)by xen. Since the stack grows downwards in x86 processors, ESP points to an address stack +8192 bytes.
The variable stack” is allocated in arch/x86/setup.c as a global array:
/*
 * Just allocate the kernel stack here. SS:ESP is set up to point here
 * in head.S.
 */
char stack [8192]; /* allocated in kernel bss */
The ESI register is then pushed on the stack and start_kernel() is called.
This is main function that sets the ball rolling. It is evident that the ESI register must be pointing to a start_info_t structure made available to the kernel by the domain creator(xm) since start_kernel() accepts it as a parameter.
The definition of this structure is in $XEN_SRC/xen/include/public/xen.h
/*
* Start-of-day memory layout:
* 1. The domain is started within contiguous virtual-memory region.
*
* 2. The contiguous region ends on an aligned 4MB boundary.
*
* 3. This the order of bootstrap elements in the initial virtual region:
* a. relocated kernel image
* b. initial ram disk [mod_start, mod_len]
* c. list of allocated page frames [mfn_list, nr_pages]
* (unless relocated due to XEN_ELFNOTE_INIT_P2M)
* d. start_info_t structure [register ESI (x86)]
* e. bootstrap page tables [pt_base, CR3 (x86)]
* f. bootstrap stack [register ESP (x86)]
*
* 4. Bootstrap elements are packed together, but each is 4kB-aligned.
* 5. The initial ram disk may be omitted.
*
* 6. The list of page frames forms a contiguous ‘pseudo-physical’ memory
* layout for the domain. In particular, the bootstrap virtual-memory
* region is a 1:1 mapping to the first section of the pseudo-physical map.
*
* 7. All bootstrap elements are mapped read-writable for the guest OS. The
* only exception is the bootstrap page table, which is mapped read-only.
*
* 8. There is guaranteed to be at least 512kB padding after the final
* bootstrap element. If necessary, the bootstrap virtual region is
* extended by an extra 4MB to ensure this.
*/
There are 3 different types of address spaces:
* Machine page frames – these are the page frames allocated in the phsycal RAM.
* (Pseudo)Physical page frames – these are contiguous page frames that map to Machine page frames. This is basically virtualized physical RAM for the domU kernel. For example physical page frames 0,1,2,3,4,5,.. map to machine frames 55,23,49,123,32…. (without any specific order).
* domU kernel virtual address space – When the mini-OS kernel is compiled and linked by gcc.
All addresses in the .text section are virtual addresses that need translation to (Pseudo)physical addresses and then to machine addresses.
This translation takes place at run time by initial page tables set up by the domain loader(xm).
The start_info structure provides initial boot time information for the domU kernel.
struct start_info {
/* THE FOLLOWING ARE FILLED IN BOTH ON INITIAL BOOT AND ON RESUME. */
char magic[32]; /* “xen-<version>-<platform>”. */
unsigned long nr_pages; /* Total pages allocated to this domain. */
unsigned long shared_info; /* MACHINE address of shared info struct. */
uint32_t flags; /* SIF_xxx flags. */
xen_pfn_t store_mfn; /* MACHINE page number of shared page. */
uint32_t store_evtchn; /* Event channel for store communication. */
union {
struct {
xen_pfn_t mfn; /* MACHINE page number of console page. */
uint32_t evtchn; /* Event channel for console page. */
} domU; /* Dom U(1,2,3,4,….255) */
struct {
uint32_t info_off; /* Offset of console_info struct. */
uint32_t info_size; /* Size of console_info struct from start.*/
} dom0; /* Dom 0 */
} console;
/* THE FOLLOWING ARE ONLY FILLED IN ON INITIAL BOOT (NOT RESUME) */
unsigned long pt_base; /* VIRTUAL address of page directory. */
unsigned long nr_pt_frames; /* Number of bootstrap p.t. frames. */
unsigned long mfn_list; /* VIRTUAL address of page-frame list. */
unsigned long mod_start; /* VIRTUAL address of pre-loaded module. */
unsigned long mod_len; /* Size (bytes) of pre-loaded module. */
/* Group Number 为了分组而加,设定一个default group */
unsigned int grp_num; /* 每个dom boot都会有一个默认分组信息 */
int8_t cmd_line[MAX_GUEST_CMDLINE];
/* The pfn range here covers both page table and p->m table frames. */
unsigned long first_p2m_pfn;/* 1st pfn forming initial P->M table. */
unsigned long nr_p2m_frames;/* # of pfns forming initial P->M table. */
};


the information in this structure is filled in by the domain loader(xm) when creating a new domain.
The shared_info structure is shared between a DomU kernel and a Dom0 kernel. This structure provides one means of communication between the two:
/*
* Xen/kernel shared data — pointer provided in start_info.
*
* This structure is defined to be both smaller than a page, and the
* only data on the shared page, but may vary in actual size even within
* compatible Xen versions; guests should not rely on the size
* of this structure remaining constant.
*/
struct shared_info {
struct vcpu_info vcpu_info[XEN_LEGACY_MAX_VCPUS];
/*
* A domain can create “event channels” on which it can send and receive
* asynchronous event notifications.
*
* There are three classes of event that are delivered by this mechanism:
*
* 1. Bi-directional inter- and intra-domain connections. Domains must
* arrange out-of-band to set up a connection (usually by allocating
* an unbound ‘listener’ port and avertising that via a storage service
* such as xenstore).
*
* 2. Physical interrupts. A domain with suitable hardware-access
* privileges can bind an event-channel port to a physical interrupt
* source.
* 3. Virtual interrupts (‘events’). A domain can bind an event-channel
* port to a virtual interrupt source, such as the virtual-timer
* device or the emergency console.
*
* Event channels are addressed by a “port index“.
* Each channel is associated with two bits of information:
*
* 1. PENDING — notifies the domain that there is a pending notification
* to be processed. This bit is cleared by the guest.
*
* 2. MASK — if this bit is clear then a 0->1 transition of PENDING
* will cause an asynchronous upcall to be scheduled. This bit is only
* updated by the guest. It is read-only within Xen.
*
* If a channel becomes pending while the channel is masked then the ‘edge’ is lost (i.e., when the channel is unmasked, the guest must manually handle pending notifications as NO upcall will be scheduled by Xen).
*
* To expedite scanning of pending notifications, any 0->1 pending
* transition on an unmasked channel causes a corresponding bit in a
* per-vcpu selector word to be set.
* Each bit in the selector covers a ‘C long’ in the PENDING bitfield array.
*/
unsigned long evtchn_pending[sizeof(unsigned long) * 8];
unsigned long evtchn_mask[sizeof(unsigned long) * 8];
/*
* Wallclock time: updated only by control software. Guests should base
* their gettimeofday() syscall on this wallclock-base value.
*/
uint32_t wc_version; /* Version counter: see vcpu_time_info_t. */
uint32_t wc_sec; /* Secs 00:00:00 UTC, Jan 1, 1970. */
uint32_t wc_nsec; /* Nsecs 00:00:00 UTC, Jan 1, 1970. */
struct arch_shared_info arch;
};


The start_kernel() Function:
Mini-OS kernel/arch/x86/setup.c 开始启动。
this is where Mini-OS sets the ball rolling. It calls a bunch of initialization functions and then sets up 3 kernel threads to run.
The non-preemptive scheduler provided with Mini-OS will then schedule these threads one after another.
The functions called by start_kernel() :
/* 初始化domin 内存信息, 事件机制和vCPU以及相关的Interrupts */
1) arch_init() [arch/x86/setup.c]
* copies the start_info structure into a global area within the kernel image(to start_info_union.start_info)
* stores the physical to machine frame mappings into a global variablephys_to_machine_mapping‘ declared in mm.c
 phys_to_machine_mapping = (unsigned long *)start_info.mfn_list;
 /* mfn_list is part of start_info structure:
 unsigned long mfn_list; /* VIRTUAL address of page-frame list. */
* Maps shared_info page at 0×2000(start of 3rd megabyte of kernel virtual address space – this info is from arch/x86/x86_32.S) this is done using a hypercall – HYPERVISOR_update_va_mapping
HYPERVISOR_update_va_mapping(
 (unsigned long)shared_info, __pte(pa | 7), UVMF_INVLPG)
Only the machine frame number of the shared_info structure is provided by the domain loader.
The domU kernel must map this page into it’s own virtual address space before it can access it’s contents.
A guest OS will wish to modify exactly one PTE rather than a batch, and where PTE is mapped into the current address space. This is crered for by the following:
update_va_mapping(unsigned long va, uint64_t val, unsigned long flags)
update the currently installed PTE(Page Table Entry) that maps virtual address “va” to new value “val”.
As with mmu_update, Xen checks the modification is safe before applying it. The flags determine which kind of TLB(Translation Lookaside Buffer) flush.
* register callback handlers:
 HYPERVISOR_set_callbacks(
 __KERNEL_CS, (unsigned long)hypervisor_callback,
__KERNEL_CS, (unsigned long)failsafe_callback);
a guest OS needs to setup the vCPU it is exectuing on.
This includes installing vectors for the virtual IDT(Interrupt Descrptior Table) so that the guest OS can handle interrupts, page faults, etc.
However the first thing a guest OS must setup is a pair of hypervisor callbacks:
There are the entry points which Xen will use when it wishes to notify the guest OS of an occurrence.
/* 初始化各种处理信号机制 */
2) trap_init() [arch/x86/trap.c]
this function registers a trap hanlder table with Xen using the set_trap_table() hypercall.
void trap_init(void)
{
 HYPERVISOR_set_trap_table(trap_table);
}

The trap table is defined as follows:

/*
* Submit a virtual IDT to the hypervisor. This consists of tuples
 * (interrupt vector, privilege ring, CS:EIP of handler).
 * The 'privilege ring' field specifies the least-privileged ring that
 * can trap to that vector using a software-interrupt instruction (INT).
 */
static trap_info_t trap_table[] = {
 { 0, 0, __KERNEL_CS, (unsigned long)divide_error },
 { 1, 0, __KERNEL_CS, (unsigned long)debug },
 { 3, 3, __KERNEL_CS, (unsigned long)int3 },
 { 4, 3, __KERNEL_CS, (unsigned long)overflow },
 { 5, 3, __KERNEL_CS, (unsigned long)bounds },
 { 6, 0, __KERNEL_CS, (unsigned long)invalid_op },
 { 7, 0, __KERNEL_CS, (unsigned long)device_not_available },
 { 9, 0, __KERNEL_CS, (unsigned long)coprocessor_segment_overrun },
 { 10, 0, __KERNEL_CS, (unsigned long)invalid_TSS },
 { 11, 0, __KERNEL_CS, (unsigned long)segment_not_present },
 { 12, 0, __KERNEL_CS, (unsigned long)stack_segment },
 { 13, 0, __KERNEL_CS, (unsigned long)general_protection },
 { 14, 0, __KERNEL_CS, (unsigned long)page_fault },
 { 15, 0, __KERNEL_CS, (unsigned long)spurious_interrupt_bug },
 { 16, 0, __KERNEL_CS, (unsigned long)coprocessor_error },
 { 17, 0, __KERNEL_CS, (unsigned long)alignment_check },
 { 19, 0, __KERNEL_CS, (unsigned long)simd_coprocessor_error },
 { 0, 0, 0, 0 }
};

These handler entry points(code for these handlers) are mostly defined in arch/x86/x86_32.h

/* 初始化内存管理机制 */
2) trap_init() [arch/x86/trap.c]
This function initializes memory management in mini-OS. Page tables are already setup by the domain loader. The mini-OS kernel’s task is to map the rest of the memory.
http://www.cs.uic.edu/~spopuri/minios.html
发表在 Research results | 发表评论