NUMA Aware Scheduling in Xen

citrix xen green sweets

Citrix Xen Sweets (by osde8info)

So, hacking the Xen Open Source hypervisor is what I do for living (and these are the guys providing me with my monthly paycheck for that: http://www.citrix.com). During the last months, I’ve been concentrating on improving NUMA awareness of the Xen scheduler, and this an attempt to describe what that is all about…

Background and Motivation

The official Xen blog already hosted a couple of stories about what is going on, in the Xen development community, regarding improving Xen NUMA support. Therefore, if you really are interested in some background and motivation, feel free to check them out:

Long story  short, they say how NUMA is becoming more and more common and that, therefore, it is very important to: (1) achieve a good initial placement, when creating a new VM; (2) have a solution that is both flexible and effective enough to take advantage of that placement during the whole VM lifetime. The former, basically, means: <<When starting a new Virtual Machine, to which NUMA node should I “associate” it with?>>. The latter is more about: <<How hard should the VM be associated to that NUMA node? Could it, perhaps temporarily, run elsewhere?>>.

NUMA Placement and Scheduling

So, here’s the situation: automatic initial placement has been included in Xen 4.2, inside libxl. This means, when a VM is created (of course, if that happens through libxl) a set of heuristics decide on which NUMA node his memory has to be allocated, and the vCPUs of the VM are statically pinned to the pCPUs of such node.
On the other hand, NUMA aware scheduling  has been under development during the last months, and is going to be included in Xen 4.3. This mean, instead of being statically pinned, the vCPUs of the VM will strongly prefer to run on the pCPUs of the NUMA node, but they can run somewhere else as well… And this is what this status report is all about.

NUMA Aware Scheduling Development

The development of this new feature started pretty early in the Xen 4.3 development cycle, and has undergone a couple of major rework along the way. The very first RFC for it dates back to the Xen 4.2 development cycle, and it showed interesting performance already. However, what was decided at the time was to concentrate only on placement, and leave scheduling for the future. After that, v1, v2 and v3 of a patch series entirely focused on NUMA aware scheduling followed. It has been discussed during XenSummit NA 2012, in a talk about NUMA future development in Xen in general (slides here).  While at it, a couple of existing scheduling anomalies of the stock credit scheduler where found and fixed (for instance, the one described here).

Right now, we can say we are almost done. In fact, v3 received positive feedback and is basically what is going to be merged, and so what Xen 4.3 will ship. Actually, there is going to be a v4 (being released on xen-devel right at the same time of this blog post), but it only accommodates very minor changes, and it is 100% functionally equal to v3.

Any Performance Numbers?

Sure thing! Benchmarks similar to the ones already described in the previous blog posts have been performed. More specifically, directly from the cover letter of the v3 of the patch series, here’s what has been done:

I ran the following benchmarks (again):
* SpecJBB is all about throughput, so pinning
  is likely the ideal solution.
* Sysbench-memory is the time it takes for
  writing a fixed amount of memory (and then
  it is the throughput that is measured). What
  we expect is locality to be important, but
  at the same time the potential imbalances
  due to pinning could have a say in it.
* LMBench-proc is the time it takes for a
  process to fork a fixed number of children.
  This is much more about latency than
  throughput, with locality of memory
  accesses playing a smaller role and, again,
  imbalances due to pinning being a potential
  issue.

This all happened on a 2 node host, where 2 to 10 VMs (2 vCPUs and 960 RAM each) were executing the various benchmarks concurrently. Here they are the results:

 ----------------------------------------------------
 | SpecJBB2005, throughput (the higher the better)  |
 ----------------------------------------------------
 | #VMs | No affinity |  Pinning  | NUMA scheduling |
 |    2 |  43318.613  | 49715.158 |    49822.545    |
 |    6 |  29587.838  | 33560.944 |    33739.412    |
 |   10 |  19223.962  | 21860.794 |    20089.602    |
 ----------------------------------------------------
 | Sysbench memory, throughput (the higher the better)
 ----------------------------------------------------
 | #VMs | No affinity |  Pinning  | NUMA scheduling |
 |    2 |  469.37667  | 534.03167 |    555.09500    |
 |    6 |  411.45056  | 437.02333 |    463.53389    |
 |   10 |  292.79400  | 309.63800 |    305.55167    |
 ----------------------------------------------------
 | LMBench proc, latency (the lower the better)     |
 ----------------------------------------------------
 | #VMs | No affinity |  Pinning  | NUMA scheduling |
 ----------------------------------------------------
 |    2 |  788.06613  | 753.78508 |    750.07010    |
 |    6 |  986.44955  | 1076.7447 |    900.21504    |
 |   10 |  1211.2434  | 1371.6014 |    1285.5947    |
 ----------------------------------------------------

Which, reasoning in terms of %-performance increase/decrease, means NUMA aware
scheduling does as follows, as compared to no-affinity at all and to static pinning:

     ----------------------------------
     | SpecJBB2005 (throughput)       |
     ----------------------------------
     | #VMs | No affinity |  Pinning  |
     |    2 |   +13.05%   |  +0.21%   |
     |    6 |   +12.30%   |  +0.53%   |
     |   10 |    +4.31%   |  -8.82%   |
     ----------------------------------
     | Sysbench memory (throughput)   |
     ----------------------------------
     | #VMs | No affinity |  Pinning  |
     |    2 |   +15.44%   |  +3.79%   |
     |    6 |   +11.24%   |  +5.72%   |
     |   10 |    +4.18%   |  -1.34%   |
     ----------------------------------
     | LMBench proc (latency)         |
     | NOTICE: -x.xx% = GOOD here     |
     ----------------------------------
     | #VMs | No affinity |  Pinning  |
     ----------------------------------
     |    2 |    -5.66%   |  -0.50%   |
     |    6 |    -9.58%   | -19.61%   |
     |   10 |    +5.78%   |  -6.69%   |
     ----------------------------------

The tables show how, when not in overload (where overload=’more vCPUs than pCPUs’), NUMA scheduling is the absolute best. In fact, not only it does a lot better than no-pinning on throughput biased benchmarks, as well as a lot better than pinning on latency biased benchmarks (especially with 6 VMs), it also equals or beats both under adverse circumstances (adverse to NUMA scheduling, i.e., beats/equals pinning in throughput benchmarks, and beats/equals no-affinity on the latency benchmark).

When the system is overloaded, NUMA scheduling scores in the middle, as it could have been expected. It must also be noticed that, when it brings benefits, they are not as huge as in the non-overloaded case. However, this only means that there is still room for more optimization, right?  In some more details, the current way a pCPU is selected for a vCPU that is waking-up, couples particularly bad with the new concept of NUMA node affinity. Changing this is not trivial, because it involves rearranging some locks inside the scheduler code, but is already being worked-on.
Anyway, even with what we have right now, we are overloading the test box by 20% here (without counting Dom0 vCPUs!) and still seeing improvements, which is definitely not bad!

What Else Is Going On?

Well, a lot… To the point that it is probably pointless to try make a list here! I maintain a NUMA roadmap on our Wiki, which I’m trying to keep updated and, more important, to honor and fulfill so, if interested in knowing what will come next, go check it out!

Posted in General, Technology, Work, Xen | Tagged , , , , , , , , , , , , , | Leave a comment

The Impact of the Meat on Our Plate

Reblogged from A Human Ecologist's View:

Click to visit the original post
  • Click to visit the original post

FROM Paul McCartney to Lord Stern, more people are promoting the benefits of a meatless society.

Meat production not only contributes to climate change and land degradation but is also a cause of air and water pollution and biodiversity loss. The farming industry accounts for nine per cent of UK total greenhouse gases, half of which come from sheep, cows and goats.

Read more… 750 more words

"An Oxford University study (http://garyhaq.wordpress.com/2011/04/29/the-impact-of-the-meat-on-our-plate/www.foe.co.uk/resource/reports/healthy_planet_eating.pdf) funded by Friends of the Earth showed that more than 45,000 lives a year could be saved if everyone ate meat no more than two or three times a week." Everyone should really think about this!
Posted in General | Leave a comment

NVIDIA's PRIME Helpers Are Ready For Linux 3.9

Reblogged from The Linux Site:

Click to visit the original post

Aside from a lot of other exciting DRM driver happenings for the Linux 3.9 kernel, it looks like the DRM "PRIME Helpers" that were conceived by NVIDIA to help them support DMA_BUF in their binary driver will be merged.

NVIDIA can't directly utilize the Linux kernel's DMA_BUF buffer sharing mechanism -- a zero-copy way to share buffers between different kernel drivers whether it be DRM or other sub-systems -- due to 

Read more… 142 more words

At least it looks like things are improving, perhaps, in a not too far future, neither all this "external" work nor these hacks will be necessary any longer...
Posted in Linux, Technology | Tagged , , , | Leave a comment

Windows 7 domU on XEN 4 HVM with Debian Squeeze dom0

Reblogged from Marcin's rootprompt blog - /dev/web0:

Click to visit the original post

A couple of not-so-obvious-at-first hints if one day you need to get Windows 7 Professional running as domU on XEN 4 HVM virtualization where your dom0 is Linux Debian Squeeze. And with the assumption that it is to be bridged with eth0/peth0.

Read more… 678 more words

Debian OpenLogo See? Nice and easy! For any other information, check Xen's Wiki at: http://wiki.xen.org/ or Website: http://www.xen.org. For issues related to Xen on Debian, check out the 'Debian' category on the wiki: http://wiki.xen.org/wiki/Debian.
Posted in Debian, Linux, Technology | Tagged , , , , , , , | 3 Comments

NVIDIA 310.32 on Kernel 3.7.7-201.fc18

English: The official symbol of the Linux dist...

Just very quickly, I wanted to build this version of the NVIDIA proprietary drivers NVIDIA-Linux-x86_64-310.32.run for this Fedora kernel kernel-3.7.7-201.fc18.x86_64.
Continue reading

Posted in Fedora, Linux, Technology | Tagged , , , , , , | 3 Comments

Fedora Optimus

Is this really about Transformers?

NVIDIA logoIt’s still quite hard for me to believe: I’ve got NVIDIA Optimus working on my Fedora 18 laptop!

When I bought this laptop, I had no idea it had this technology on board. To be honest, I didn’t know a thing about Optimus. When it arrived, I discovered that, not only I had no means of using the discrete NVIDIA graphic card, but I’d better find a way to turn it off, or the whole PC would be on fire in matter of minutes.

Basically, I’m talking about this:

dario@Abyss ~]$ lspci|grep -i vga
00:02.0 VGA compatible controller: Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller (rev 09)
01:00.0 VGA compatible controller: NVIDIA Corporation GF108 [GeForce GT 540M] (rev ff)

And Wikipedia calls it “an optimization technology”. Sure, tell me about it!!
Continue reading

Posted in Fedora, General, Linux, Technology | Tagged , , , , , , , , , | 16 Comments

Local .vimrc

Well, I’m sure it’s something well known, but I just discovered it, so here I am.

Ok, let’s assume you work on projects that would require you to have this in your .vimrc:

set autoindent
set cindent
set shiftwidth=4
set expandtab

But also on other projects that would require this:

set autoindent
set cindent
set shiftwidth=8
set noexpandtab

Continue reading

Posted in Fedora, Linux, Technology | Tagged , , , , , , , , | 7 Comments

Xen DocDay on Jan 28 2013

Xen Document Day: January 28

Yes, we, in the Xen.org community, do care a lot about code (and project in general) documentation.

That’s why we’re having periodical Xen Document Days!

The first Xen document day of 2013 will be held next Monday, and it is open to everyone who care about Xen Documentation and want to improve it.

Everybody is welcome onboard. To jump in, just joining the Xen wiki and hang out on the #xendocs IRC channel.
Continue reading

Posted in Events, General, Technology, Xen | Tagged , , , , , , | Leave a comment

Using xen-tools on Fedora

Xen.org blog already hosted a very nice post by Ian Jackson, greatly explaining how useful xen-tools is for automatically installing Debian (and Debian-derived) VMs. Now, if this all happens on a Debian host, it is nice and easy, as getting xen-tools is just a matter of apt-get install-ing it. But what if your host machine runs something else, for instance, a copy of Fedora? As a matter of fact, starting from Fedora 16, Xen is quite easy to install and use on Fedora, making it interesting to cover this case too.

There is no xen-tools RPM package, thus we need to go the good old way: download the sources, compile and  install them. Luckily enough, this is not difficult at all, and this blog post will explain in details how to achieve it.
Continue reading

Posted in Fedora, Linux, Technology, Xen | Tagged , , , , , , | 2 Comments

No more .rej-s and .orig-s

Why on Earth one wants a blog if he can’t write down there that damn command that you always forgot (either entirely, or when it comes to some syntax detail, or parameter ordering, or …) ?

So, here we are, the fist post of this kind, a.k.a.:

How to remove all those .rej and .orig files, resulting from days of applying and rebasing patches, that are cluttering the output of your grep-s with stale content?

Continue reading

Posted in Linux, Technology | Tagged , , , , , , , | 2 Comments