lproven | The dirty secret behind VMware, VirtualPC and all current PC virtualisation* (Reply)

Some new Macintel owners are discovering that their choices for running non-Mac OS X programs on their new Macs are narrower than they thought. There is no Classic environment on OS X86, so you can't run MacOS 9 or any MacOS 9 applications. Also, VirtualPC, the primary solution for running Windows under OS X on PowerPC doesn't work on Rosetta, Apple's subsystem for running PPC code on x86. (Rosetta, incidentally, is licensed in from Transitive, it's not Apple code. Transitive are doing a deal with Intel to get PowerPC, SPARC and maybe MIPS code running on Itanium, which could be interesting, and help rescue the sinking Itanic.)

At the moment, if you want to run Windows under OS X86, there are 1 or 2 QEMU-based solutions out there, such as OpenOSX.

However, there's an important caveat here. Most people don't understand anything about how virtualisation works, which is fair enough. They might know, though, that running x86 code on a PowerPC will be slow, because they're different chips. This is true.

But running PC emulation/virtualisation software on an x86 chip does not mean that it runs at full native speeds.

Current x86 chips cannot virtualise themselves. They are physically incapable of it. Therefore, running PC virtualisation on a PC, via VMware or VirtualPC, means emulating a PC processor. Just like running x86 on a G5, running x86 on x86 still runs through a software CPU emulator. Not all of it, I grant; some code - "safe" code that runs in urivileged "user mode" on the x86 chip - can be run natively. However, privileged code, the stuff that forms the kernel of any operating system, can't be run natively and has to be run through an emulator.

Technically, ring 3 (user-mode) code generally runs natively, with a supervisor checking for hardware access and redirecting it, but ring 0 (kernel-mode) code runs through a CPU emulator. It has to - on the classic x86 architecture, ring 0 is the maximum privilege level and can see through any redirection or virtualisation. What VT and Pacifica do is add a new protection level, ring -1, which can control code running in ring 0. Ring -1 would be used to run a "hypervisor" OS, a very small, simple system that just manages VMs in which run the real OSes that you actually use.

So, x86 emulation on x86 is still slow. Not as slow as on PPC, say - obviously, x86 is a really really good fit for x86 emulation! This means that running Windows inside a VM x86 is still significantly slower than on the bare metal.

So if you buy a Macintel and get OpenOSX or something for it, don't expect your copy of XP-in-a-window to run as fast as OS X does. It won't.

How much slowdown there is is contentious. The virtualisation vendors want to make out that it's insignificant. It's not. I've tried benchmarking it on a casual basis but it was very hard - for one thing, successive runs differ in timings, because of smart caching in some of the better virtualisation products. An initial first-run slowdown of from 30-40% to as poor as a slowdown of 50+% wasn't unusual. Successive runs improved this by 10-20%.

(A slowdown of 30% means the emulated computer runs at 70% of the speed of the host.)

And, of course, it depends on what you run. Some OSs run entirely in ring 0 (privileged mode), some relatively little. For instance, NT 3 was massively better than NT4 in this respect. NT3 ran just the kernel in ring 0; NT4 moved the entire GUI (technically, the GDI) in there too. This improved performance at the expense of system reliability, and was for myself & many other observers the moment when MS' claims of a newly-engineered clean OS were irrevocably compromised by the marketdroids. NT4 blew a huge hole in NT's good clean design, just to make it look faster.

The slight snag in trying to read up on this is that vanishingly few technical journalists in this day and age understand such things. It was common in the days of Byte but not now. Most have never seen anything but Windows and know little of the technical underpinnings. I have had great fun cross-examining representatives of both Connectix and VMware at press conferences and so on. Connectix responded by flying in Jon Gerber, the chief scientist and founder of the company, from the US, to talk to me.

I've also tried benchmarking it, but it's tricky. Lots of things slow down, though. Sure, your apps will almost all run in ring 3, natively, but then, all apps call the kernel. Every access to the filesystem hits the kernel in ring 0. Every access to the network hits the kernel in ring 0. Every time you move the mouse or the time in the status bar increments, in NT4 and everything derived from it, it hits the kernel in ring 0, because the pillocks moved the entire GDI - the whole damned GUI - into the kernel, which is around the time that the tech world realised that MS were just pissing around at OS design and didn't actually take it really seriously.

All kernel-mode code runs in ring 0. Almost all code running on a computer will require kernel activity and thus will be running in a software emulation, not on your actual CPU hardware. Pure user-space CPU-intensive code is relatively rare and isn't generally the sort of thing you do on server, especially not virtualised ones. If you're doing scientific analysis, say, you don't do it on virtual servers, you do it on a cluster; if you do rendering, you use a render farm, and so on.

But if you're trying to benchmark one of these things, a lot of the simpler benchmarking tools are user-space code and so will give results that yield an entirely false impression of good performance. This is why my benchmarking was casual - I couldn't get reliable enough results to publish, and the results I was getting did not bear out the subjective impression of the performance of the emulated PC, which was much worse than the benchmarks would appear to indicate.

This debate - over the inefficiency of virtualised x86 code - is why some of us are keenly interested in the question of whether the Yonah CPU in the Macintels has Intel's "Vanderpool" virtualisation technology (VT) integrated and working or not. Proper hardware virtualisation in the CPU is coming, slowly, now, in the form of Intel's VT (codenamed Vanderpool) and AMD's more capable Pacifica (Intel VT only virtualises the CPU itself; AMD's as-yet-unreleased virtualisation tech also virtualises I/O devices.)

The word is that the Core Duo (and presumably Core Solo) do have working VT. If they do and your OS X VM solution uses it, then in theory, virtualised code will not be emulated and will run at full speed.

As it is, all ring 0 code in a VM is running inside a software CPU emulator, just like in VirtualPC on a PowerPC Mac, for example, and this cripples performance. Sure, modern machines can afford that - they have CPU power to burn - but it's horribly inelegant and I find that extremely distasteful.

If VT is not there or not working or VPC does not use it, virtualised ring 0 code will be emulated, and performance will, relatively speaking, suck.

So it's a question of major interest to anyone who wants to run virtualised Windows under OS X86 and wants half-decent performance!

[Disclaimer: this is a bit patchy and slightly dated, as it's edited together from 3 or more CIX posts from recent weeks. Sorry about that.]

* Except, possibly, Parallels Workstation on an Intel Core CPU. This claims to use Intel VT, I believe.