Next: Getting Started with PTLsim/X
Up: PTLsim/X: Full System SMP/SMT
Previous: PTLsim/X: Full System SMP/SMT
Contents
Subsections
Full system simulation and virtualization has been around since the
dawn of computers. Typically virtual machine software is used
to run guest operating systems on a physical host system,
such that the guest believes it is running directly on the bare hardware.
Modern full system simulators in the x86 world can be roughly divided
into two groups (this paper does not consider systems for other instruction
sets).
Hypervisors execute most unprivileged instructions on the native
CPU at full speed, but trap privileged instructions used by the operating
system kernel, where they are emulated by hypervisor software so as
to maintain isolation between virtual machines and make the virtual
machine nearly indistinguishable from the real CPU. In some cases
(particularly on x86), additional software techniques are needed to
fully hide the hypervisor from the guest OS.
- Xen [6,7,5,8,9,2]
represents the current state of the art in this field; it will be
described in great detail later on.
- VMware [12] is a very well known commercial product
that allows unmodified x86 operating systems to run inside a virtual
machine. Because the x86 instruction set is not fully virtualizable,
VMware must employ x86-to-x86 binary translation techniques on kernel
code (but not user mode code) to make the virtual CPU indistinguishable
from the real CPU for compatibility reasons. These translations are
typically cached in a hidden portion of the guest address space to
improve performance compared to simply interpreting sensitive x86
instructions. While this approach is sophisticated and effective,
it exacts a heavy performance penalty on I/O intensive workloads [9].
Interestingly, the latest microprocessors from Intel and AMD include
hardware features (Intel VT [15], AMD SVM [16])
to eliminate the binary translation and patching overhead. Xen fully
supports these technologies to allow running Windows and other OS's
at full speed, while VMware has yet to include full support.
VMware comes in two flavors. ESX is a true hypervisor that boots on
the bare hardware underneath the first guest OS. GSX and Workstation
use a userspace frontend process containing all virtual device drivers
and the binary translator, while the vmmon kernel module (open
source in the Linux version) handles memory virtualization and context
switching tasks similar to Xen.
- Several other products, including Virtual PC and Parallels, provide
features similar to VMware using similar technology.
- KVM (Kernel Virtual Machine) is a new hypervisor infrastructure
built into all Linux kernels after 2.6.19. It depends on the hardware
virtualization extensions (Intel VT and AMD SVM) built into modern
x86 chips, whereas Xen and VMware also support running on older processors
without special hardware support. KVM is an attractive foundation
for future virtual machine development since it's built into Linux
(so it requires far less setup work than Xen or VMware) and provides
excellent performance.
Unlike hypervisors, simulators perform cycle accurate execution
of x86 instructions using interpreter software, without running any
guest instructions on the native CPU.
- Bochs [11] is the most well known open source x86
simulator; it is considered to be a nearly RTL (register transfer
language) level description of every x86 behavior from legacy 16-bit
features up through modern x86-64 instructions. Bochs is very
useful for the functional validation of real x86 microprocessors,
but it is very slow (around 5-10 MHz equivalent) and is not useful
for implementing cycle accurate models of modern uop-based out of
order x86 processors (for instance, it does not model caches, memory
latency, functional units and so on).
- QEMU [10] is similar in purpose to VMware, but unlike
VMware, it supports multiple CPU host and guest architectures (PowerPC,
SPARC, ARM, etc). QEMU uses binary translation technology similar
to VMware to hide the hypervisor's presence from the guest kernel.
However, due to its cross platform design, both kernel and user code
is passed through x86-to-x86 binary translation (even on x86 platforms)
and stored in a translation cache. Interestingly, Xen uses a substantial
amount of QEMU code to model common hardware devices when running
unmodified operating systems like Windows, but Xen still uses its
own hardware-assisted technology to actually achieve virtualization.
QEMU supports a proprietary hypervisor module to add VMware's and
Xen's ability to run user mode code natively on the CPU to reduce
the performance penalty; hence it is also in the hypervisor category.
- Simics [13] is a commercial simulation suite for
modeling both the functional aspects of various x86 processors (including
vendor specific extensions) as well as user-designed plug-in models
of real hardware devices. It is used extensively in industry for modeling
new hardware and drivers, as well as firmware level debugging. Like
QEMU, Simics uses x86-to-x86 binary translation to instrument code
at a very low level while achieving good performance (though noticeably
slower than a hypervisor provides). Unlike QEMU, Simics is fully extensible
and supports a huge range of real hardware models, but it is not possible
to add cycle accurate simulation features below the x86 instruction
level, making it less useful to microarchitects (both because of technical
considerations as well as its status as a closed source product).
- SimNow [14] is an AMD simulation tool used during
the design and validation of AMD's x86-64 hardware. Like Simics, it
is a functional simulator only, but it models a variety of AMD-built
hardware devices. SimNow uses x86-to-x86 binary translation technology
similar to Simics and QEMU to achieve good performance. Because SimNow
does not provide cycle accurate timing data, AMD uses its own TSIM
trace-based simulator, derived from the K8 RTL, to do actual validation
and timing studies. SimNow is available for free to the public, albeit
as closed source.
All of these tools share one common disadvantage: they are unable
to model execution at a level below the granularity of x86 instructions,
making them unsuitable to microarchitects. PTLsim/X seeks to fill
this void by allowing extremely detailed uop-level cycle accurate
simulation of x86 and x86-64 microprocessor cores, while simultaneously
delivering all the performance benefits of true native-mode hypervisors
like Xen, selective binary translation based hypervisors like VMware
and QEMU, and the detailed hardware modeling capabilities of Bochs
and Simics.
Xen [7,6,5,8,9,2]
is an open source x86 virtual machine monitor, also known as a hypervisor.
Each virtual machine is called a ``domain'', where domain 0 is
privileged and accesses all hardware devices using the standard drivers;
it can also create and directly manipulate other domains. Guest domains
typically do not have hardware access do not have this access; instead,
they relay requests back to domain 0 using Xen-specific virtual device
drivers. Each guest can have up to 32 VCPUs (virtual CPUs). Xen itself
is loaded into a reserved region of physical memory before loading
a Linux kernel as domain 0; other operating systems can run in guest
domains. Xen is famous for having essentially zero overhead due to
its unique and well planned design; it's possible to run a normal
workstation or server under Xen with full native performance.
Under Xen's ``paravirtualized'' mode, the guest OS runs on an
architecture nearly identical to x86 or x86-64, but a few small changes
(critical to preserving native performance levels) must be made to
low-level kernel code, similar in scope to adding support for a new
type of system chipset or CPU manufacturer (e.g. instead of an AMD
x86-64 on an nVidia chipset, the kernel would need to support a Xen-extended
x86-64 CPU on a Xen virtual ``chipset''). These changes mostly
concern page tables and the interrupt controller:
- Paging is always enabled, and any physical pages (called ``machine
frame numbers'', or MFNs) used to form a page table must be marked
read-only (a.k.a. ``pinned'') everywhere. Since the processor
can only access a physical page if it's referenced by some page table,
Xen can guarantee memory isolation between domains by forcing the
guest kernel to replace any writes to page table pages with special
mmu_update() hypercalls (a.k.a. system calls into Xen itself).
Xen makes sure each update points to a page owned by the domain before
updating the page table. This approach has essentially zero performance
loss since the guest kernel can read its own page tables without any
further indirections (i.e. the page tables point to the actual physical
addresses), and hypercalls are only needed for batched updates (e.g.
validating a new page table after a fork() requires only a
single hypercall).
- Xen also supports pseudo-physical pages, which are consecutively
numbered from 0 to some maximum (i.e. 65536 for a 256 MB domain).
This is required because most kernels (including Linux and Windows)
do not support ``sparse'' (discontiguous) physical memory ranges
very well (remember that every domain can still address every physical
page, including those of other domains - it just can't access all
of them). Xen provides pseudo-to-machine (P2M) and machine-to-pseudo
(M2P) tables to do this mapping. However, the physical page tables
still continue to reference physical addresses and are fully visible
to the guest kernel; this is just a convenience feature.
- Xen can save an entire domain to disk, then restore it later starting
at that checkpoint. Since Xen tracks every read-only page that's part
of some page table, it can restore domains even if the original physical
pages are now used by something else: it automatically remaps all
MFNs in every page table page it knows about (but the guest kernel
must never store machine page numbers outside of page table pages
- it's the same concept as in garbage collection, where pointers must
only reside in the obvious places).
- Xen can migrate running domains between machines by tracking which
physical pages become dirty as the domain executes. Xen uses shadow
page tables for this: it makes copy-on-write duplicates of the domain's
page tables, and presents these internal tables to the CPU, while
the guest kernel still thinks it's using the original page tables.
Once the migration is complete, the shadow page tables are merged
back into the real page tables (as with a save and restore) and the
domain continues as usual.
- The memory allocation of each domain is elastic: the domain can give
any free pages back to Xen via the ``balloon'' mechanism; these
pages can then be re-assigned to other domains that need more memory
(up to a per-domain limit).
- Domains can share some of their pages with other domains using the
grant mechanism. This is used for zero-copy network and disk
I/O between domain 0 and guest domains.
- Interrupts are delivered using an event channel mechanism,
which is functionally identical to the IO-APIC hardware on the bare
CPU (essentially it's a ``Xen APIC'' instead of the Intel and
AMD models already supported by the guest kernel). Xen sets up a shared
info page containing bit vectors for masked and pending interrupts
(just like an APIC's memory mapped registers), and lets the guest
kernel register an event handler function. Xen then does an upcall
to this function whenever a virtual interrupt arrives; the guest kernel
manipulates the mask and pending bits to ensure race-free notifications.
Xen automatically maps physical IRQs on the APIC to event channels
in domain 0, plus it adds its own virtual interrupts (for the usual
timer and a Xen-specific notification port; use cat /proc/interrupts
on a Linux system under Xen to see this). When the guest domain has
multiple VCPUs, interprocessor interrupts (IPIs) are done through
the Xen event controller in a manner identical to hardware IPIs.
- Xen is unique in that PCI devices can be assigned to any domain, so
for instance each guest domain could have its own dedicated PCI network
card and disk controller - there's no need to relay requests back
to domain 0 in this configuration, although it only works with hardware
that supports IOMMU virtualization (otherwise it's a security risk,
since DMA can be used to bypass Xen's page table protections).
- Xen provides the guest with additional timers, so it can be aware
of both ``wall clock'' time as well as execution time (since there
may be gaps in the latter as other domains use the CPU); this lets
it provide a smooth interactive experience in a way systems like VMware
cannot. The timers are delivered as virtual interrupt events.
- All other features of the paravirtualized architecture perfectly match
x86. The guest kernel can still use most x86 privileged instructions,
such as rdmsr, wrmsr, and control register updates (which
Xen transparently intercepts and validates), and in domain 0, it can
access I/O ports, memory mapped I/O, the normal x86 segmentation (GDT
and LDT) and interrupt mechanisms (IDT), etc. This makes it possible
to run a normal Linux distribution, with totally unmodified drivers
and software, at full native speed (we do just this on all our development
workstations and servers). Benchmarks [9] have
shown Xen to have ~2-3% performance decrease relative
to a traditional Linux kernel, where as VMware and similar solutions
yield a 20-70% decrease under heavy I/O.
Xen also supports ``HVM'' (hardware virtual machine) mode, which
is equivalent to what VMware [12], QEMU [10], Bochs
[11] and similar systems provide: nearly perfect emulation
of the x86 architecture and some standard peripherals. The advantage
is that an uncooperative guest OS never knows it's running in a virtual
machine: Windows XP and Mac OS X have been successfully run inside
Xen in this mode. Unfortunately, this mode has a well known performance
cost, even when Xen leverages the specialized hardware support for
full virtualization in newer Intel [15] and AMD [16]
chips. The overhead comes from the requirement that the hypervisor
still trap and emulate all sensitive instructions, whereas paravirtualized
guests can intelligently batch together requests in one hypercall
and can avoid virtual device driver overhead.
Next: Getting Started with PTLsim/X
Up: PTLsim/X: Full System SMP/SMT
Previous: PTLsim/X: Full System SMP/SMT
Contents
Matt T Yourst
2007-09-26