Next: x86 Instructions and Micro-Ops
Up: PTLsim User's Guide
Previous: PTLsim Architecture
Contents
Subsections
PTLsim Code Base
PTLsim is written in C++ with extensive use of x86 and x86-64 inline
assembly code. It must be compiled with gcc on a Linux 2.6 based x86
or x86-64 machine. The C++ variant used by PTLsim is known as Embedded
C++. Essentially, we only use the features found in C, but add templates,
classes and operator overloading. Other C++ features such as hidden
side effects in constructors, exception handling, RTTI, multiple inheritance,
virtual methods (in most cases), thread local storage and so on are
forbidden since they cannot be adequately controlled in the embedded
``bare hardware'' environment in which PTLsim runs, and can result
in poor performance. We have our own standard template library, SuperSTL,
that must be used in place of the C++ STL.
Even though the PTLsim code base is very large, it is well organized
and structured for extensibility. The following section is an overview
of the source files and subsystems in PTLsim:
- PTLsim Core Subsystems:
- ptlsim.cpp and ptlsim.h
are responsible for general top-level PTLsim tasks and starting the
appropriate simulation core code.
- uopimpl.cpp contains implementations of all
uops and their variations. PTLsim implements most ALU and floating
point uops in assembly language so as to leverage the exact semantics
and flags generated by real x86 instructions, since most PTLsim uops
are so similar to the equivalent x86 instructions. When compiled on
a 32-bit system, some of the 64-bit uops must be emulated using slower
C++ code.
- ptlhwdef.cpp and ptlhwdef.h
define the basic uop encodings, flags and registers. The tables of
uops might be interesting to see how a modern x86 processor is designed
at the microcode level. The basic format is discussed in Section 5.1;
all uops are documented in Section 27.
- seqcore.cpp implements the sequential in-order
core. This is a strictly functional core, without data caches, branch
prediction and so forth. Its purpose is to provide fast execution
of the raw uop stream and debugging of issues with the decoder, microcode
or virtual hardware rather than a specific core model.
- Decoder, Microcode and Basic Block Cache:
- decode-core.cpp coordinates the
translation from x86 and x86-64 into uops, maintains the basic block
cache and handles self modifying code, invalidation and other x86
specific complexities.
- decode-fast.cpp decodes the subset of the
x86 instruction set used by 95% of all instructions with four or
fewer uops. It should be considered the ``fast path'' decoder
in a hardware microprocessor.
- decode-complex.cpp decodes complex instructions
into microcode, and provides most of the assists (microcode subroutines)
required by x86 machines.
- decode-sse.cpp decodes all SSE, SSE2, SSE3
and MMX instructions
- decode-x87.cpp decodes x87 floating point
instructions and provides the associated microcode
- decode.h contains definitions of the above
functions and classes.
- Out Of Order Core:
- ooocore.cpp is the out of order simulator
control logic. The microarchitectural model implemented by this simulator
is the subject of Part IV.
- ooopipe.cpp implements the discrete pipeline
stages (frontend and backend) of the out of order model.
- oooexec.cpp implements all functional units,
load/store units and issue queue and replay logic
- ooocore.h defines most of the configurable
parameters for the out of order core not intrinsic to the PTLsim uop
instruction set itself.
- dcache.cpp and dcache.h
contain the data cache model. At present the full L1/L2/L3/mem hierarchy
is modeled, along with miss buffers, load fill request queues, ITLB/DTLB
and bus interfaces. The cache hierarchy is very flexible configuration
wise; it is described further in Section 25.
- branchpred.cpp and branchpred.h
is the branch predictor. By default, this is set up as a hybrid bimodal
and history based predictor with various customizable parameters.
- Linux Hosted Kernel Interface:
- kernel.cpp and kernel.h
is where all the virtual machine "black magic" takes
place to let PTLsim transparently switch between simulation and native
mode and 32-bit/64-bit mode (or only 32-bit mode on a 32-bit x86 machine).
In general you should not need to touch this since it is very Linux
kernel specific and works at a level below the standard C/C++ libraries.
- lowlevel-64bit.S contains 64-bit startup
and context switching code. PTLsim execution starts here if run on
an x86-64 system.
- lowlevel-32bit.S contains 32-bit startup
and context switching code. PTLsim execution starts here if run on
a 32-bit x86 system.
- injectcode.cpp is compiled into the 32-bit
and 64-bit code injected into the target process to map the ptlsim
binary and pass control to it.
- loader.h is used to pass information to the
injected boot code.
- PTLsim/X Bare Hardware and Xen Interface:
- ptlxen.cpp brings up PTLsim on the bare hardware,
dispatches traps and interrupts, virtualizes Xen hypercalls, communicates
via DMA with the PTLsim monitor process running in the host domain
0 and otherwise serves as the kernel of PTLsim's own mini operating
system.
- ptlxen-memory.cpp is responsible for all
page based memory operations within PTLsim. It manages PTLsim's own
internal page tables and its physical memory map, and services page
table walks, parts of the x86 microcode and memory-related Xen hypercalls.
- ptlxen-events.cpp provides all interrupt
(VIRQ) and event handling, manages PTLsim's time dilation technology,
and provides all time and event related hypercalls.
- ptlxen-common.cpp provides common functions
used by both PTLsim itself and PTLmon.
- ptlxen.h provides inline functions and defines
related to full system PTLsim/X.
- ptlmon.cpp provides the PTLsim monitor process,
which runs in domain 0 and interfaces with the PTLsim hypervisor code
inside the target domain to allow it to communicate with the outside
world. It uses a client/server architecture to forward control commands
to PTLsim using DMA and Xen hypercalls.
- xen-types.h contains Xen-specific type definitions
- ptlsim-xen-hypervisor.diff and ptlsim-xen-tools.diff
are patches that must be applied to the Xen hypervisor source tree
and the Xen userspace tools, respectively, to allow PTLsim to be injected
into domains.
- ptlxen.lds and ptlmon.lds
are linker scripts used to lay out the memory image of PTLsim and
PTLmon.
- lowlevel-64bit-xen.S contains the PTLsim/X
boot code, interrupt handling and exception handling
- ptlctl.cpp is a utility used within
a domain under simulation to control PTLsim
- ptlcalls.h provides a library of functions
used by code within the target domain to control PTLsim.
- Support Subsystems:
- superstl.h, superstl.cpp
and globals.h implement various standard
library functions and classes as an alternative to C++ STL. These
libraries also contain a number of features very useful for bit manipulation.
- logic.h is a library of C++ templates for
implementing synchronous logic structures like associative arrays,
queues, register files, etc. It has some very clever features like
FullyAssociativeArray8bit, which uses x86 SSE vector
instructions to associatively match and process ~16
byte-sized tags every cycle. These classes are fully parameterized
and useful for all kinds of simulations.
- mm.cpp is the PTLsim custom memory manager.
It provides extremely fast memory allocation functions based on multi-threaded
slab caching (the same technique used inside Linux itself) and extent
allocation, along with a traditional physical page allocator. The
memory manager also provides PTLsim's garbage collection system, used
to discard unused or least recently used objects when allocations
fail.
- mathlib.cpp and mathlib.h
provide standard floating point functions suitable for embedded systems
use. These are used heavily as part of the x87 microcode.
- klibc.cpp and klibc.h
provide standard libc-like library functions suitable for use on the
bare hardware
- syscalls.cpp and syscalls.h
declare all Linux system call stubs. This is also used by PTLsim/X,
which emulates some Linux system calls to make porting easier.
- config.cpp and config.h
manage the parsing of configuration options for each user program.
This is a general purpose library used by both PTLsim itself and the
userspace tools (PTLstats, etc)
- datastore.cpp and datastore.h
manage the PTLsim statistics data store file structure.
- Userspace Tools:
- ptlstats.cpp is a utility for printing and
analyzing the statistics data store files in various human readable
ways.
- dstbuild is a Perl script used to parse stats.h
and generate the datastore template (Section 8)
- makeusage.cpp is used to capture the usage
text (help screen) for linking into PTLsim
- cpuid.cpp is a utility program to show various
data returned by the x86 cpuid instruction. Run it
under PTLsim for a surprise.
- glibc.cpp contains miscellaneous userspace
functions
- ptlcalls.c and ptlcalls.h
are optionally compiled into user programs to let them switch into
and out of simulation mode on their own. The ptlcalls.o
file is typically linked with Fortran programs that can't use regular
C header files.
PTLsim includes a number of powerful C++ templates, macros and functions
not found anywhere else. This section attempts to provide an overview
of these structures so that users of PTLsim will use them instead
of trying to duplicate work we've already done.
The file globals.h contains a wide range of very useful
definitions, functions and macros we have accumulated over the years,
including:
- Basic data types used throughout PTLsim (e.g. W64
for 64-bit words, Waddr for words the same
size as pointers, and so on)
- Type safe C++ template based functions, including min,
max, abs, mux,
etc.
- Iterator macros (foreach)
- Template based metaprogramming functions including lengthof
(finds the length of any static array), offsetof(offset of member in structure), baseof (member
to base of structure), and log2 (takes the
base-2 log of any constant at compile time)
- Floor, ceiling and masking functions for integers and powers of two
(floor, trunc, ceil,
mask, floorptr, ceilptr,
maskptr, signext, etc)
- Bit manipulation macros (bit, bitmask,
bits, lowbits, setbit,
clearbit, assignbit).
Note that the bitvec template (see below) should
be used in place of these macros wherever it is more convenient.
- Comparison functions (aligned, strequal,
inrange, clipto)
- Modulo arithmetic (add_index_modulo, modulo_span,
et al)
- Definitions of basic x86 SSE vector functions (e.g. x86_cpu_pcmpeqbet al)
- Definitions of basic x86 assembly language functions (e.g. x86_bsf64
et al)
- A full suite of bit scanning functions (lsbindex,
msbindex, popcount et
al)
- Miscellaneous functions (arraycopy, setzero,
etc)
The Super Standard Template Library (SuperSTL) is an internal C++
library we use internally in lieu of the normal C++ STL for various
technical and preferential reasons. While the full documentation is
in the comments of superstl.h and superstl.cpp,
the following is a brief list of its features:
- I/O stream classes familiar from Standard C++, including istream
and ostream. Unique to SuperSTL is how the
comma operator (``,'') can be used to separate a list of objects
to send to or from a stream, in addition to the usual C++ insertion
operator (``<<'').
- To read and write binary data, the idstream and odstream
classes should be used instead.
- String buffer (stringbuf) class for composing
strings in memory the same way they would be written to or read from
an ostream or istream.
- String formatting classes (intstring, hexstring,
padstring, bitstring,
bytemaskstring, floatstring)
provide a wrapper around objects to exercise greater control of how
they are printed.
- Array (array) template class represents a fixed
size array of objects. It is essentially a simple but very fast wrapper
for a C-style array.
- Bit vector (bitvec) is a heavily optimized
and rewritten version of the Standard C++ bitset
class. It supports many additional operations well suited to logic
design purposes and emphasizes extremely fast branch free code.
- Dynamic Array (dynarray) template class provides
for dynamically sized arrays, stacks and other such structures, similar
to the Standard C++ valarray class.
- Linked list node (listlink) template class
forms the basis of double linked list structures in which a single
pointer refers to the head of the list.
- Queue list node (queuelink) template class supports
more operations than listlink and can serve as both
a node in a list and a list head/tail header.
- Index reference (indexref) is a smart pointer which
compresses a full pointer into an index into a specific structure
(made unique by the template parameters). This class behaves exactly
like a pointer when referenced, but takes up much less space and may
be faster. The indexrefnull class adds support for
storing null pointers, which indexref lacks.
- Hashtable class is a general purpose chaining
based hash table with user configurable key hashing and management
via add-on template classes.
- SelfHashtable class is an optimized hashtable
for cases where objects contain their own keys. Its use is highly
recommended instead of Hashtable.
- ChunkList class maintains a linked list of
small data items, but packs many of these items into a chunk, then
chains the chunks together. This is the most cache-friendly way of
maintaining variable length lists.
- CRC32 calculation class is useful for hashing
- CycleTimer is useful for timing intervals with
sub-nanosecond precision using the CPU cycle counter (discussed in
Section 11.5).
The Logic Standard Template Library (LogicSTL) is an internally developed
add-on to SuperSTL which supports a variety of structures useful for
modeling sequential logic. Some of its primitives may look familiar
to Verilog or VHDL programmers. While the full documentation is in
the comments of logic.h, the following is a brief
list of its features:
- latch template class works like any other assignable
variable, but the new value only becomes visible after the clock()
method is called (potentially from a global clock chain).
- Queue template class implements a general purpose
fixed size queue. The queue supports various operations from both
the head and the tail, and is ideal for modeling queues in microprocessors.
- Iterators for Queue objects such as foreach_forward,
foreach_forward_from, foreach_forward_after,
foreach_backward, foreach_backward_from,
foreach_backward_before.
- HistoryBuffer maintains a shift register of
values, which when combined with a hash function is useful for implementing
predictor histories and the like.
- FullyAssociativeTags template class is a general
purpose array of associative tags in which each tag must be unique.
This class uses highly efficient matching logic and supports pseudo-LRU
eviction, associative invalidation and direct indexing. It forms the
basis for most associative structures in PTLsim.
- FullyAssociativeArray pairs a FullyAssociativeTags
object with actual data values to form the basis of a cache.
- AssociativeArray divides a
FullyAssociativeArray into sets. In effect,
this class can provide a complete cache implementation for a processor.
- LockableFullyAssociativeTags, LockableFullyAssociativeArray
and LockableAssociativeArray provide the same
services as the classes above, but support locking lines into the
cache.
- CommitRollbackCache leverages the LockableFullyAssociativeArray
class to provide a cache structure with the ability to roll back all
changes made to memory (not just within this object, but everywhere)
after a checkpoint is made.
- FullyAssociativeTags8bit and FullyAssociativeTags16bit
work just like FullyAssociativeTags, except
that these classes are dramatically faster when using small 8-bit
and 16-bit tags. This is possible through the clever use of x86 SSE
vector instructions to associatively match and process 16 8-bit tags
or 8 16-bit tags every cycle. In addition, these classes support features
like removing an entry from the middle of the array while compacting
entries around it in constant time. These classes should be used in
place of FullyAssociativeTags whenever the tags are
small enough (i.e. almost all tags except for memory addresses).
- FullyAssociativeTagsNbitOneHot is similar to
FullyAssociativeTagsNbit, but the user must
guarantee that all tags are unique. This property is used to perform
extremely fast matching even with long tags (32+ bits). The tag data
is striped across multiple SSE vectors and matched in parallel, then
a clever adaptation of the sum-of-absolute-differences SSE instruction
is used to extract the single matching element (if any) in O(1) time.
The out of order simulator, ooocore.h, contains several reusable classes,
including:
- IssueQueue template class can be used to implement
all kinds of broadcast based issue queues
- StateList and ListOfStateLists
is useful for collecting various lists that objects can be on into
one structure.
Next: x86 Instructions and Micro-Ops
Up: PTLsim User's Guide
Previous: PTLsim Architecture
Contents
Matt T Yourst
2007-09-26