%% LyX 1.5.1 created this file.  For more info, see http://www.lyx.org/.
%% Do not edit unless you really know what you are doing.
\documentclass[12pt,english]{report}
\usepackage{mathptmx}
\usepackage{helvet}
\renewcommand{\ttdefault}{cmtt}
\usepackage[T1]{fontenc}
\usepackage[latin9]{inputenc}
\usepackage{geometry}
\geometry{verbose,letterpaper,tmargin=1in,bmargin=1in,lmargin=1in,rmargin=1in,headheight=0in,headsep=0in,footskip=0.25in}
\setcounter{secnumdepth}{3}
\setcounter{tocdepth}{3}
\setlength{\parskip}{\medskipamount}
\setlength{\parindent}{0pt}
\usepackage{array}
\usepackage{fancybox}
\usepackage{calc}
\usepackage{setspace}

\makeatletter

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands.
\newcommand{\lyxline}[1][1pt]{%
  \par\noindent%
  \rule[.5ex]{\linewidth}{#1}\par}
%% Bold symbol macro for standard LaTeX users
\providecommand{\boldsymbol}[1]{\mbox{\boldmath $#1$}}

%% Because html converters don't know tabularnewline
\providecommand{\tabularnewline}{\\}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Textclass specific LaTeX commands.
\newenvironment{lyxcode}
{\begin{list}{}{
\setlength{\rightmargin}{\leftmargin}
\setlength{\listparindent}{0pt}% needed for AMS classes
\raggedright
\setlength{\itemsep}{0pt}
\setlength{\parsep}{0pt}
\normalfont\ttfamily}%
 \item[]}
{\end{list}}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands.
\usepackage[pdftitle={PTLsim User's Guide and Reference},colorlinks=true,linkcolor=blue,anchorcolor=blue,citecolor=blue,urlcolor=blue]{hyperref}

\usepackage{babel}
\makeatother

\begin{document}
\noindent \begin{center}
\textbf{\huge ~}
\par\end{center}{\huge \par}

\vfill{}


\noindent \begin{center}
\textsf{\textbf{\Huge PTLsim User's Guide and Reference}}
\par\end{center}{\Huge \par}

\noindent \begin{center}
\emph{\huge The Anatomy of an x86-64 Out of Order}\\
\emph{\huge Superscalar Microprocessor}
\par\end{center}{\huge \par}

\bigskip{}


\noindent \begin{center}
{\LARGE Matt T. Yourst}\\
\texttt{\large <yourst@yourst.com>}
\par\end{center}{\large \par}

\noindent \begin{center}
Revision 20070923\\
\textbf{\emph{Second Edition}}
\par\end{center}

\vfill{}


\noindent \begin{center}
The latest version of PTLsim and this document are always available
at:\textsf{\textbf{\LARGE }}\\
\textsf{\textbf{\LARGE }}\\
\textsf{\textbf{\LARGE www.ptlsim.org}}
\par\end{center}{\LARGE \par}

\bigskip{}


\noindent \begin{center}
$\copyright$ 2007 Matt T. Yourst \texttt{\small <yourst@yourst.com>}.
\par\end{center}

\noindent \begin{center}
The PTLsim software and manual are free software;\\
they are licensed under the GNU General Public License version 2.
\par\end{center}

\tableofcontents{}


\part{\label{part:Introduction}PTLsim User's Guide}


\chapter{Introducing PTLsim}


\section{Introducing PTLsim}

\textbf{PTLsim} is a state of the art cycle accurate microprocessor
simulator and virtual machine for the x86 and x86-64 instruction sets.
PTLsim models a modern superscalar out of order x86-64 compatible
processor core at a configurable level of detail ranging from full-speed
native execution on the host CPU all the way down to RTL level models
of all key pipeline structures. In addition, the complete cache hierarchy,
memory subsystem and supporting hardware devices are modeled with
true cycle accuracy. PTLsim supports the full x86-64 instruction set
of the Pentium 4+, Athlon 64 and similar machines with all extensions
(x86-64, SSE/SSE2/SSE3, MMX, x87). It is currently the only tool available
to the public to support true cycle accurate modeling of real x86
microarchitectures.

PTLsim is very different from most cycle accurate simulators. Because
it runs directly on the same platform it is simulating (an x86 or
x86-64 machine, typically running Linux), it is able to switch in
and out of full out of order simulation mode and native x86 or x86-64
mode at any time completely transparent to the running user code.
This lets users quickly profile a small section of the user code without
the overhead of emulating the uninteresting parts, and enables automatic
debugging by finding the divergence point between a real reference
machine and the simulation.

PTLsim comes in two flavors. The classic version runs any 32-bit or
64-bit single threaded userspace Linux application. We have successfully
run a wide array of programs under PTLsim, from typical benchmarks
to graphical applications and network servers.

PTLsim/X runs on the bare hardware and integrates with Xen hypervisor,
allowing it to provide full system x86-64 simulation, multi-processor
and multi-threading support (SMT and multi-core models), checkpoints,
cycle accurate virtual device timing models, deterministic time dilation,
and much more, all without sacrificing the speed and accuracy inherent
in PTLsim's design. PTLsim/X makes it possible to run any Xen-compatible
operating system under simulation; we have successfully booted arbitrary
Linux distributions and industry standard applications and benchmarks
under PTLsim/X.

Compared to competing simulators, PTLsim provides extremely high performance
even when running in full cycle accurate out of order simulation mode.
Through extensive tuning, cache profiling and the use of x86 specific
accelerated vector operations and instructions, PTLsim significantly
cuts simulation time compared to traditional research simulators.
Even with its optimized core, PTLsim still allows a significant amount
of flexibility for easy experimentation through the use of optimized
C++ template classes and libraries suited to synchronous logic design.


\section{History}

PTLsim was designed and developed by Matt T. Yourst \texttt{\footnotesize <yourst@yourst.com>}
with its beginnings dating back to 2001. The main PTLsim code base,
including the out of order processor model, has been in active development
since 2003 and has been used extensively by our processor design research
group at the State University of New York at Binghamton in addition
to hundreds of major universities, industry research labs and several
well known microprocessor vendors.

PTLsim is not related to other legacy simulators. It is our hope that
PTLsim will help microprocessor researchers move to a contemporary
and widely used instruction set (x86 and x86-64) with readily available
hardware implementations. This will provide a new option for researchers
stuck with simulation tools supporting only the Alpha or MIPS based
instruction sets, both of which have since been discontinued on real
commercially available hardware (making co-simulation impossible)
with an uncertain future in up to date compiler toolchains.

The PTLsim software and this manual are free software, licensed under
the GNU General Public License version 2.


\chapter{Getting Started}


\section{Documentation Map}

This manual has been divided into several parts:

\begin{itemize}
\item Part \ref{part:Introduction} introduces PTLsim, reviews the x86 architecture,
and describes PTLsim's implementation of x86 in terms of uops, microcode
and internal structures.
\item Part \ref{sec:PTLsimClassic} describes the use and implementation
of userspace PTLsim.

\begin{itemize}
\item If you simply want to \emph{use} PTLsim, this part starts with an
easy to follow \textbf{tutorial}
\end{itemize}
\item Part \ref{sec:PTLsimFullSystem} describes the use and implementation
of full system PTLsim/X.

\begin{itemize}
\item If you simply want to \emph{use} full system PTLsim/X, this part starts
with an easy to follow \textbf{tutorial}
\end{itemize}
\item Part \ref{part:OutOfOrderModel} details the design and implementation
of the PTLsim out of order superscalar core model

\begin{itemize}
\item Read this part if you want to understand and modify PTLsim's out of
order core.
\end{itemize}
\item Part \ref{part:Appendices} is a reference manual for the PTLsim internal
uop instruction set, the performance monitoring events the simulator
supports and a variety of other technical information.
\end{itemize}

\section{Additional Resources}

The latest version of PTLsim and this document are always available
at the PTLsim web site:

\begin{quote}
\textsf{\textbf{\large http://www.ptlsim.org}}{\large \par}
\end{quote}

\chapter{PTLsim Architecture}


\chapter{\label{sec:PTLsimCodeBase}PTLsim Code Base}


\section{Code Base Overview}

PTLsim is written in C++ with extensive use of x86 and x86-64 inline
assembly code. It must be compiled with gcc on a Linux 2.6 based x86
or x86-64 machine. The C++ variant used by PTLsim is known as Embedded
C++. Essentially, we only use the features found in C, but add templates,
classes and operator overloading. Other C++ features such as hidden
side effects in constructors, exception handling, RTTI, multiple inheritance,
virtual methods (in most cases), thread local storage and so on are
forbidden since they cannot be adequately controlled in the embedded
{}``bare hardware'' environment in which PTLsim runs, and can result
in poor performance. We have our own standard template library, SuperSTL,
that must be used in place of the C++ STL.

Even though the PTLsim code base is very large, it is well organized
and structured for extensibility. The following section is an overview
of the source files and subsystems in PTLsim:

\begin{itemize}
\item \textbf{PTLsim Core Subsystems:}

\begin{itemize}
\item \texttt{\textbf{\small ptlsim.cpp}} and \texttt{\textbf{\small ptlsim.h}}
are responsible for general top-level PTLsim tasks and starting the
appropriate simulation core code.
\item \texttt{\textbf{\small uopimpl.cpp}} contains implementations of all
uops and their variations. PTLsim implements most ALU and floating
point uops in assembly language so as to leverage the exact semantics
and flags generated by real x86 instructions, since most PTLsim uops
are so similar to the equivalent x86 instructions. When compiled on
a 32-bit system, some of the 64-bit uops must be emulated using slower
C++ code.
\item \texttt{\textbf{\small ptlhwdef.cpp}} and \texttt{\textbf{\small ptlhwdef.h}}
define the basic uop encodings, flags and registers. The tables of
uops might be interesting to see how a modern x86 processor is designed
at the microcode level. The basic format is discussed in Section \ref{sec:UopIntro};
all uops are documented in Section \ref{sec:UopReference}.
\item \texttt{\textbf{\small seqcore.cpp}} implements the sequential in-order
core. This is a strictly functional core, without data caches, branch
prediction and so forth. Its purpose is to provide fast execution
of the raw uop stream and debugging of issues with the decoder, microcode
or virtual hardware rather than a specific core model.
\end{itemize}
\item \textbf{Decoder, Microcode and Basic Block Cache:}

\begin{itemize}
\item \texttt{\textbf{\small decode-core.cpp}}{\small{} }coordinates the
translation from x86 and x86-64 into uops, maintains the basic block
cache and handles self modifying code, invalidation and other x86
specific complexities.
\item \texttt{\textbf{\small decode-fast.cpp}} decodes the subset of the
x86 instruction set used by 95\% of all instructions with four or
fewer uops. It should be considered the {}``fast path'' decoder
in a hardware microprocessor.
\item \texttt{\textbf{\small decode-complex.cpp}} decodes complex instructions
into microcode, and provides most of the assists (microcode subroutines)
required by x86 machines.
\item \texttt{\textbf{\small decode-sse.cpp}} decodes all SSE, SSE2, SSE3
and MMX instructions
\item \texttt{\textbf{\small decode-x87.cpp}} decodes x87 floating point
instructions and provides the associated microcode
\item \texttt{\textbf{\small decode.h}} contains definitions of the above
functions and classes.
\end{itemize}
\item \textbf{Out Of Order Core:}

\begin{itemize}
\item \texttt{\textbf{\small ooocore.cpp}} is the out of order simulator
control logic. The microarchitectural model implemented by this simulator
is the subject of Part \ref{part:OutOfOrderModel}.
\item \texttt{\textbf{\small ooopipe.cpp}} implements the discrete pipeline
stages (frontend and backend) of the out of order model.
\item \texttt{\textbf{\small oooexec.cpp}} implements all functional units,
load/store units and issue queue and replay logic
\item \texttt{\textbf{\small ooocore.h}} defines most of the configurable
parameters for the out of order core not intrinsic to the PTLsim uop
instruction set itself.
\item \texttt{\textbf{\small dcache.cpp}} and \texttt{\textbf{\small dcache.h}}
contain the data cache model. At present the full L1/L2/L3/mem hierarchy
is modeled, along with miss buffers, load fill request queues, ITLB/DTLB
and bus interfaces. The cache hierarchy is very flexible configuration
wise; it is described further in Section \ref{sec:CacheHierarchy}.
\item \texttt{\textbf{\small branchpred.cpp}} and \texttt{\textbf{\small branchpred.h}}
is the branch predictor. By default, this is set up as a hybrid bimodal
and history based predictor with various customizable parameters.
\end{itemize}
\item \textbf{Linux Hosted Kernel Interface:}

\begin{itemize}
\item \texttt{\textbf{\small kernel.cpp}} and \texttt{\textbf{\small kernel.h}}
is where all the virtual machine \char`\"{}black magic\char`\"{} takes
place to let PTLsim transparently switch between simulation and native
mode and 32-bit/64-bit mode (or only 32-bit mode on a 32-bit x86 machine).
In general you should not need to touch this since it is very Linux
kernel specific and works at a level below the standard C/C++ libraries.
\item \texttt{\textbf{\small lowlevel-64bit.S}} contains 64-bit startup
and context switching code. PTLsim execution starts here if run on
an x86-64 system.
\item \texttt{\textbf{\small lowlevel-32bit.S}} contains 32-bit startup
and context switching code. PTLsim execution starts here if run on
a 32-bit x86 system.
\item \texttt{\textbf{\small injectcode.cpp}} is compiled into the 32-bit
and 64-bit code injected into the target process to map the \texttt{\small ptlsim}
binary and pass control to it.
\item \texttt{\textbf{\small loader.h}} is used to pass information to the
injected boot code.
\end{itemize}
\item \textbf{PTLsim/X Bare Hardware and Xen Interface:}

\begin{itemize}
\item \texttt{\textbf{\small ptlxen.cpp}} brings up PTLsim on the bare hardware,
dispatches traps and interrupts, virtualizes Xen hypercalls, communicates
via DMA with the PTLsim monitor process running in the host domain
0 and otherwise serves as the kernel of PTLsim's own mini operating
system.
\item \texttt{\textbf{\small ptlxen-memory.cpp}} is responsible for all
page based memory operations within PTLsim. It manages PTLsim's own
internal page tables and its physical memory map, and services page
table walks, parts of the x86 microcode and memory-related Xen hypercalls.
\item \texttt{\textbf{\small ptlxen-events.cpp}} provides all interrupt
(VIRQ) and event handling, manages PTLsim's time dilation technology,
and provides all time and event related hypercalls.
\item \texttt{\textbf{\small ptlxen-common.cpp}} provides common functions
used by both PTLsim itself and PTLmon.
\item \texttt{\textbf{\small ptlxen.h}} provides inline functions and defines
related to full system PTLsim/X.
\item \texttt{\textbf{\small ptlmon.cpp}} provides the PTLsim monitor process,
which runs in domain 0 and interfaces with the PTLsim hypervisor code
inside the target domain to allow it to communicate with the outside
world. It uses a client/server architecture to forward control commands
to PTLsim using DMA and Xen hypercalls.
\item \texttt{\textbf{\small xen-types.h}} contains Xen-specific type definitions
\item \texttt{\textbf{\small ptlsim-xen-hypervisor.diff}} and \texttt{\textbf{\small ptlsim-xen-tools.diff}}
are patches that must be applied to the Xen hypervisor source tree
and the Xen userspace tools, respectively, to allow PTLsim to be injected
into domains.
\item \texttt{\textbf{\small ptlxen.lds}} and \texttt{\textbf{\small ptlmon.lds}}
are linker scripts used to lay out the memory image of PTLsim and
PTLmon.
\item \texttt{\textbf{\small lowlevel-64bit-xen.S}} contains the PTLsim/X
boot code, interrupt handling and exception handling
\item \texttt{\textbf{\footnotesize ptlctl.cpp}} is a utility used within
a domain under simulation to control PTLsim
\item \texttt{\textbf{\footnotesize ptlcalls.h}} provides a library of functions
used by code within the target domain to control PTLsim.
\end{itemize}
\item \textbf{Support Subsystems:}

\begin{itemize}
\item \texttt{\textbf{\small superstl.h}}, \texttt{\textbf{\small superstl.cpp}}
and \texttt{\textbf{\small globals.h}} implement various standard
library functions and classes as an alternative to C++ STL. These
libraries also contain a number of features very useful for bit manipulation.
\item \texttt{\textbf{\small logic.h}} is a library of C++ templates for
implementing synchronous logic structures like associative arrays,
queues, register files, etc. It has some very clever features like
\texttt{\small FullyAssociativeArray8bit}, which uses x86 SSE vector
instructions to associatively match and process \textasciitilde{}16
byte-sized tags every cycle. These classes are fully parameterized
and useful for all kinds of simulations.
\item \texttt{\textbf{\small mm.cpp}} is the PTLsim custom memory manager.
It provides extremely fast memory allocation functions based on multi-threaded
slab caching (the same technique used inside Linux itself) and extent
allocation, along with a traditional physical page allocator. The
memory manager also provides PTLsim's garbage collection system, used
to discard unused or least recently used objects when allocations
fail.
\item \texttt{\textbf{\small mathlib.cpp}} and \texttt{\textbf{\small mathlib.h}}
provide standard floating point functions suitable for embedded systems
use. These are used heavily as part of the x87 microcode.
\item \texttt{\textbf{\small klibc.cpp}} and \texttt{\textbf{\small klibc.h}}
provide standard libc-like library functions suitable for use on the
bare hardware
\item \texttt{\textbf{\small syscalls.cpp}} and \texttt{\textbf{\small syscalls.h}}
declare all Linux system call stubs. This is also used by PTLsim/X,
which emulates some Linux system calls to make porting easier.
\item \texttt{\textbf{\small config.cpp}} and \texttt{\textbf{\small config.h}}
manage the parsing of configuration options for each user program.
This is a general purpose library used by both PTLsim itself and the
userspace tools (PTLstats, etc)
\item \texttt{\textbf{\small datastore.cpp}} and \texttt{\textbf{\small datastore.h}}
manage the PTLsim statistics data store file structure.
\end{itemize}
\item \textbf{Userspace Tools:}

\begin{itemize}
\item \texttt{\textbf{\small ptlstats.cpp}} is a utility for printing and
analyzing the statistics data store files in various human readable
ways.
\item \texttt{\textbf{\small dstbuild}} is a Perl script used to parse stats.h
and generate the datastore template (Section \ref{sec:StatisticsInfrastructure})
\item \texttt{\textbf{\small makeusage.cpp}} is used to capture the usage
text (help screen) for linking into PTLsim
\item \texttt{\textbf{\small cpuid.cpp}} is a utility program to show various
data returned by the x86 \texttt{\small cpuid} instruction. Run it
under PTLsim for a surprise.
\item \texttt{\textbf{\small glibc.cpp}} contains miscellaneous userspace
functions
\item \texttt{\textbf{\small ptlcalls.c}} and \texttt{\textbf{\small ptlcalls.h}}
are optionally compiled into user programs to let them switch into
and out of simulation mode on their own. The \texttt{\textbf{\small ptlcalls.o}}
file is typically linked with Fortran programs that can't use regular
C header files.
\end{itemize}
\end{itemize}

\section{Common Libraries and Logic Design APIs}

PTLsim includes a number of powerful C++ templates, macros and functions
not found anywhere else. This section attempts to provide an overview
of these structures so that users of PTLsim will use them instead
of trying to duplicate work we've already done.


\subsection{General Purpose Macros}

The file \texttt{\small globals.h} contains a wide range of very useful
definitions, functions and macros we have accumulated over the years,
including:

\begin{itemize}
\item Basic data types used throughout PTLsim (e.g. \texttt{\footnotesize W64}
for 64-bit words, \texttt{\footnotesize Waddr} for words the same
size as pointers, and so on)
\item Type safe C++ template based functions, including \texttt{\footnotesize min},
\texttt{\footnotesize max}, \texttt{\footnotesize abs}, \texttt{\footnotesize mux},
etc.
\item Iterator macros (\texttt{\footnotesize foreach}) 
\item Template based metaprogramming functions including \texttt{\footnotesize lengthof}
(finds the length of any static array), \texttt{\footnotesize offsetof}{\footnotesize{}
}(offset of member in structure), \texttt{\footnotesize baseof} (member
to base of structure), and \texttt{\footnotesize log2} (takes the
base-2 log of any constant at compile time)
\item Floor, ceiling and masking functions for integers and powers of two
(\texttt{\footnotesize floor}, \texttt{\footnotesize trunc}, \texttt{\footnotesize ceil},
\texttt{\footnotesize mask}, \texttt{\footnotesize floorptr}, \texttt{\footnotesize ceilptr},
\texttt{\footnotesize maskptr}, \texttt{\footnotesize signext}, etc)
\item Bit manipulation macros (\texttt{\footnotesize bit}, \texttt{\footnotesize bitmask},
\texttt{\footnotesize bits}, \texttt{\footnotesize lowbits}, \texttt{\footnotesize setbit},
\texttt{\footnotesize clearbit}, \texttt{\footnotesize assignbit}).
Note that the \texttt{\footnotesize bitvec} template (see below) should
be used in place of these macros wherever it is more convenient.
\item Comparison functions (\texttt{\footnotesize aligned}, \texttt{\footnotesize strequal},
\texttt{\footnotesize inrange}, \texttt{\footnotesize clipto})
\item Modulo arithmetic (\texttt{\footnotesize add\_index\_modulo}, \texttt{\footnotesize modulo\_span},
et al)
\item Definitions of basic x86 SSE vector functions (e.g. \texttt{\footnotesize x86\_cpu\_pcmpeqb}{\footnotesize{}
}et al)
\item Definitions of basic x86 assembly language functions (e.g. \texttt{\footnotesize x86\_bsf64}
et al)
\item A full suite of bit scanning functions (\texttt{\footnotesize lsbindex},
\texttt{\footnotesize msbindex}, \texttt{\footnotesize popcount} et
al)
\item Miscellaneous functions (\texttt{\footnotesize arraycopy}, \texttt{\footnotesize setzero},
etc)
\end{itemize}

\subsection{Super Standard Template Library (SuperSTL)}

The Super Standard Template Library (SuperSTL) is an internal C++
library we use internally in lieu of the normal C++ STL for various
technical and preferential reasons. While the full documentation is
in the comments of \texttt{\small superstl.h} and \texttt{\small superstl.cpp},
the following is a brief list of its features:

\begin{itemize}
\item I/O stream classes familiar from Standard C++, including \texttt{\footnotesize istream}
and \texttt{\footnotesize ostream}. Unique to SuperSTL is how the
comma operator ({}``,'') can be used to separate a list of objects
to send to or from a stream, in addition to the usual C++ insertion
operator ({}``<\,{}<'').
\item To read and write binary data, the \texttt{\small idstream} and \texttt{\small odstream}
classes should be used instead.
\item String buffer (\texttt{\footnotesize stringbuf}) class for composing
strings in memory the same way they would be written to or read from
an \texttt{\footnotesize ostream} or \texttt{\footnotesize istream}.
\item String formatting classes (\texttt{\footnotesize intstring}, \texttt{\footnotesize hexstring},
\texttt{\footnotesize padstring}, \texttt{\footnotesize bitstring},
\texttt{\footnotesize bytemaskstring}, \texttt{\footnotesize floatstring})
provide a wrapper around objects to exercise greater control of how
they are printed.
\item Array (\texttt{\footnotesize array}) template class represents a fixed
size array of objects. It is essentially a simple but very fast wrapper
for a C-style array.
\item Bit vector (\texttt{\footnotesize bitvec}) is a heavily optimized
and rewritten version of the Standard C++ \texttt{\footnotesize bitset}
class. It supports many additional operations well suited to logic
design purposes and emphasizes extremely fast branch free code.
\item Dynamic Array (\texttt{\footnotesize dynarray}) template class provides
for dynamically sized arrays, stacks and other such structures, similar
to the Standard C++ \texttt{\small valarray} class.
\item Linked list node (\texttt{\footnotesize listlink}) template class
forms the basis of double linked list structures in which a single
pointer refers to the head of the list.
\item Queue list node (\texttt{\small queuelink}) template class supports
more operations than \texttt{\small listlink} and can serve as both
a node in a list and a list head/tail header.
\item Index reference (\texttt{\small indexref}) is a smart pointer which
compresses a full pointer into an index into a specific structure
(made unique by the template parameters). This class behaves exactly
like a pointer when referenced, but takes up much less space and may
be faster. The \texttt{\small indexrefnull} class adds support for
storing null pointers, which \texttt{\small indexref} lacks.
\item \texttt{\footnotesize Hashtable} class is a general purpose chaining
based hash table with user configurable key hashing and management
via add-on template classes.
\item \texttt{\footnotesize SelfHashtable} class is an optimized hashtable
for cases where objects contain their own keys. Its use is highly
recommended instead of \texttt{\footnotesize Hashtable}.
\item \texttt{\footnotesize ChunkList} class maintains a linked list of
small data items, but packs many of these items into a chunk, then
chains the chunks together. This is the most cache-friendly way of
maintaining variable length lists.
\item \texttt{\footnotesize CRC32} calculation class is useful for hashing
\item \texttt{\footnotesize CycleTimer} is useful for timing intervals with
sub-nanosecond precision using the CPU cycle counter (discussed in
Section \ref{sec:Timing}).
\end{itemize}

\subsection{Logic Standard Template Library (LogicSTL)}

The Logic Standard Template Library (LogicSTL) is an internally developed
add-on to SuperSTL which supports a variety of structures useful for
modeling sequential logic. Some of its primitives may look familiar
to Verilog or VHDL programmers. While the full documentation is in
the comments of \texttt{\small logic.h}, the following is a brief
list of its features:

\begin{itemize}
\item \texttt{\footnotesize latch} template class works like any other assignable
variable, but the new value only becomes visible after the \texttt{\footnotesize clock()}
method is called (potentially from a global clock chain).
\item \texttt{\footnotesize Queue} template class implements a general purpose
fixed size queue. The queue supports various operations from both
the head and the tail, and is ideal for modeling queues in microprocessors.
\item Iterators for \texttt{\footnotesize Queue} objects such as \texttt{\footnotesize foreach\_forward},
\texttt{\footnotesize foreach\_forward\_from}, \texttt{\footnotesize foreach\_forward\_after},
\texttt{\footnotesize foreach\_backward}, \texttt{\footnotesize foreach\_backward\_from},
\texttt{\footnotesize foreach\_backward\_before}.
\item \texttt{\footnotesize HistoryBuffer} maintains a shift register of
values, which when combined with a hash function is useful for implementing
predictor histories and the like.
\item \texttt{\footnotesize FullyAssociativeTags} template class is a general
purpose array of associative tags in which each tag must be unique.
This class uses highly efficient matching logic and supports pseudo-LRU
eviction, associative invalidation and direct indexing. It forms the
basis for most associative structures in PTLsim.
\item \texttt{\footnotesize FullyAssociativeArray} pairs a \texttt{\footnotesize FullyAssociativeTags}
object with actual data values to form the basis of a cache.
\item \texttt{\footnotesize AssociativeArray}{\footnotesize{} }divides a
\texttt{\footnotesize FullyAssociativeArray} into sets. In effect,
this class can provide a complete cache implementation for a processor.
\item \texttt{\footnotesize LockableFullyAssociativeTags}, \texttt{\footnotesize LockableFullyAssociativeArray}
and \texttt{\footnotesize LockableAssociativeArray} provide the same
services as the classes above, but support locking lines into the
cache.
\item \texttt{\footnotesize CommitRollbackCache} leverages the \texttt{\footnotesize LockableFullyAssociativeArray}
class to provide a cache structure with the ability to roll back all
changes made to memory (not just within this object, but everywhere)
after a checkpoint is made.
\item \texttt{\footnotesize FullyAssociativeTags8bit} and \texttt{\footnotesize FullyAssociativeTags16bit}
work just like \texttt{\footnotesize FullyAssociativeTags}, except
that these classes are dramatically faster when using small 8-bit
and 16-bit tags. This is possible through the clever use of x86 SSE
vector instructions to associatively match and process 16 8-bit tags
or 8 16-bit tags every cycle. In addition, these classes support features
like removing an entry from the middle of the array while compacting
entries around it in constant time. These classes should be used in
place of \texttt{\small FullyAssociativeTags} whenever the tags are
small enough (i.e. almost all tags except for memory addresses).
\item \texttt{\footnotesize FullyAssociativeTagsNbitOneHot} is similar to
\texttt{\footnotesize FullyAssociativeTagsNbit}, but the user must
guarantee that all tags are unique. This property is used to perform
extremely fast matching even with long tags (32+ bits). The tag data
is striped across multiple SSE vectors and matched in parallel, then
a clever adaptation of the sum-of-absolute-differences SSE instruction
is used to extract the single matching element (if any) in O(1) time.
\end{itemize}

\subsection{Miscellaneous Code}

The out of order simulator, ooocore.h, contains several reusable classes,
including:

\begin{itemize}
\item \texttt{\footnotesize IssueQueue} template class can be used to implement
all kinds of broadcast based issue queues
\item \texttt{\footnotesize StateList} and \texttt{\footnotesize ListOfStateLists}
is useful for collecting various lists that objects can be on into
one structure.
\end{itemize}

\chapter{\label{part:x86andUops}x86 Instructions and Micro-Ops (uops)}


\section{\label{sec:UopIntro}Micro-Ops (uops) and TransOps}

PTLsim presents to the target code a full implementation of the x86
and x86-64 instruction set (both 32-bit and 64-bit modes), including
most user and kernel level instructions supported by the Intel Pentium
4 and AMD K8 microprocessors (i.e. all standard instructions, SSE/SSE2,
x86-64 and most of x87 FP). At the present stage of development, the
vast majority of all userspace and 32-bit/64-bit privileged instructions
are supported.

The x86 instruction set is based on the two-operand CISC concept of
load-and-compute and load-compute-store. However, all modern x86 processors
(including PTLsim) do not directly execute complex x86 instructions.
Instead, these processors translate each x86 instruction into a series
of micro-operations (\emph{uops}) very similar to classical load-store
RISC instructions. Uops can be executed very efficiently on an out
of order core, unlike x86 instructions. In PTLsim, uops have three
source registers and one destination register. They may generate a
64-bit result and various x86 status flags, or may be loads, stores
or branches.

The x86 instruction decoding process initially generates translated
uops (\emph{transops}), which have a slightly different structure
than the true uops used in the processor core. Specifically, sources
and destinations are represented as un-renamed architectural registers
(or special temporary register numbers), and a variety of additional
information is attached to each uop only needed during the renaming
and retirement process. TransOps (represented by the \texttt{\small TransOp}
structure) consist of the following:

\begin{itemize}
\item \texttt{\footnotesize som}: Start of Macro-Op. Since x86 instructions
may consist of multiple transops, the first transop in the sequence
has its \texttt{\small som} bit set to indicate this.
\item \texttt{\footnotesize eom}: End of Macro-Op. This bit is set for the
last transop in a given x86 instruction (which may also be the first
uop for single-uop instructions)
\item \texttt{\footnotesize bytes}: Number of bytes in the corresponding
x86 instruction (1-15). The same \texttt{\footnotesize bytes} field
value is present in all uops comprising an x86 instruction.
\item \texttt{\footnotesize opcode}: the uop (not x86) opcode
\item \texttt{\footnotesize size}: the effective operation size (0-3, for
1/2/4/8 bytes)
\item \texttt{\footnotesize cond}\texttt{\small :} the x86 condition code
for branches, selects, sets, etc. For loads and stores, this field
is reused to specify unaligned access information as described later.
\item \texttt{\footnotesize setflags}: subset of the x86 flags set by this
uop (see Section \ref{sub:FlagsManagement})
\item \texttt{\footnotesize internal}: set for certain microcode operations.
For instance, loads and stores marked internal access on-chip registers
or buffers invisible to x86 code (e.g. machine state registers, segmentation
caches, floating point constant tables, etc).
\item \texttt{\footnotesize rd}, \texttt{\footnotesize ra}, \texttt{\footnotesize rb},
\texttt{\footnotesize rc}: the architectural source and destination
registers (see Section \ref{sub:RegisterRenaming})
\item \texttt{\footnotesize extshift}: shift amount (0-3 bits) used for
shifted adds (x86 memory addressing and LEA). The \texttt{\footnotesize rc}
operand is shifted left by this amount.
\item \texttt{\footnotesize cachelevel}: used for prefetching and non-temporal
loads and stores
\item \texttt{\footnotesize rbimm} and \texttt{\footnotesize rcimm}: signed
64-bit immediates for the rb and rc operands. These are selected by
specifying the special constant \texttt{\footnotesize REG\_imm} in
the \texttt{\footnotesize rb} and \texttt{\footnotesize rc} fields,
respectively.
\item \texttt{\footnotesize riptaken}: for branches only, the 64-bit target
RIP of the branch if it were taken.
\item \texttt{\footnotesize ripseq}: for branches only, the 64-bit sequential
RIP of the branch if it were not taken.
\end{itemize}
Appendix \ref{sec:UopReference} describes the semantics and encoding
of all uops supported by the PTLsim processor model. The following
is an overview of the common features of these uops and how they are
used to synthesize specific x86 instructions.


\section{Load-Execute-Store Operations}

Simple integer and floating point operations are fairly straightforward
to decode into loads, stores and ALU operations; a typical load-op-store
ALU operation will consist of a load to fetch one operand, the ALU
operation itself, and a store to write the result. The instruction
set also implements a number of important but complex instructions
with bizarre semantics; typically the translator will synthesize and
inject into the uop stream up to 8 uops for more complex instructions.


\section{\label{sec:OperationSizes}Operation Sizes}

Most x86-64 instructions can operate on 8, 16, 32 or 64 bits of a
given register. For 8-bit and 16-bit operations, only the low 8 or
16 bits of the destination register are actually updated; 32-bit and
64-bit operations are zero extended as with RISC architectures. As
a result, a dependency on the old destination register may be introduced
so merging can be performed. Fortunately, since x86 features destructive
overwrites of the destination register (i.e. the \texttt{\footnotesize rd}
and \texttt{\footnotesize ra} operands are the same), the \texttt{\small ra}
operand is generally already a dependency. Thus, the PTLsim uop encoding
reserves 2 bits to specify the operation size; the low bits of the
new result are automatically merged with the old destination value
(in \texttt{\footnotesize ra}) as part of the ALU logic. This applies
to the \texttt{\small mov} uop as well, allowing operations like {}``\texttt{\footnotesize mov
al,bl}'' in one uop. Loads do not support this mode, so loads into
8-bit and 16-bit registers must be followed by a separate \texttt{\footnotesize mov}
uop to truncate and merge the loaded value into the old destination
properly. Fortunately this is not necessary when the load-execute
form is used with 8-bit and 16-bit operations.

The x86 ISA defines some bizarre byte operations as a carryover from
the ancient 8086 architecture; for instance, it is possible to address
the second byte of many integer registers as a separate register (i.e.
as \texttt{\footnotesize ah}, \texttt{\footnotesize bh}, \texttt{\footnotesize ch},
\texttt{\footnotesize dh}). The \texttt{\footnotesize mask} uop is
used for handling this rare but important set of operations.


\section{\label{sub:FlagsManagement}Flags Management and Register Renaming}

Many x86 arithmetic instructions modify some or all of the processor's
numerous status and condition flag bits, but only 5 are relevant to
normal execution: Zero, Parity, Sign, Overflow, Carry. In accordance
with the well-known {}``ZAPS rule'', any instruction that updates
any of the Z/P/S flags updates all three flags, so in reality only
three flag entities need to be tracked: ZPS, O, F ({}``ZAPS'' also
includes an Auxiliary flag not accessible by most modern user instructions;
it is irrelevant to the discussion below).

The x86 flag update semantics can hamper out of order execution, so
we use a simple and well known solution. The 5 flag bits are attached
to each result and physical register (along with \emph{invalid} and
\emph{waiting} bits used by some cores); these bits are then consumed
along with the actual result value by any consumers that also need
to access the flags. It should be noted that not all uops generate
all the flags as well as a 64-bit result, and some uops only generate
flags and no result data. 

The register renaming mechanism is aware of these semantics, and tracks
the latest x86 instruction in program order to update each set of
flags (ZAPS, C, O); this allows branches and other flag consumers
to directly access the result with the most recent program-ordered
flag updates yet still allows full out of order scheduling. To do
this, x86 processors maintain three separate rename table entries
for the ZAPS, CF, OF flags in addition to the register rename table
entry, any or all of which may be updated when uops are renamed. The
\texttt{\small TransOp} structure for each uop has a 3-bit \texttt{\small setflags}
field filled out during decoding in accordance with x86 semantics;
the \texttt{\small SETFLAG\_ZF}, \texttt{\small SETFLAG\_CF}, \texttt{\small SETFLAG\_OF}
bits in this field are used to determine which of the ZPS, O, F flag
subsets to rename.

As mentioned above, any consumer of the flags needs to consult at
most three distinct sources: the last ZAPS producer, the Carry producer
and the Overflow producer. This conveniently fits into PTLsim's three-operand
uop semantics. Various special uops access the flags associated with
an operand rather than the 64-bit operand data itself. Branches always
take two flag sources, since in x86 this is enough to evaluate any
possible condition code combination (the \texttt{\footnotesize cond\_code\_to\_flag\_regs}
array provides this mapping).

Various ALU instructions consume only the flags part of a source physical
register; these include \texttt{\footnotesize addc} (add with carry),
\texttt{\footnotesize rcl}\texttt{\small /}\texttt{\footnotesize rcr}{\footnotesize{}
}(rotate carry), \texttt{\footnotesize sel.}\texttt{\emph{\footnotesize cc}}
(select for conditional moves) and so on. Finally, the \texttt{\footnotesize collcc}
uop takes three operands (the latest producer of the ZAPS, CF and
OF flags) and merges the flag components of each operand into a single
flag set as its result.

PTLsim also provides compound compare-and-branch uops (\texttt{\footnotesize br.sub.cc}
and \texttt{\footnotesize br.and.cc}); these are currently used mostly
in microcode, but a core could dynamically merge \texttt{\footnotesize CMP}
or \texttt{\footnotesize TEST} and \texttt{\footnotesize Jcc} instructions
into these uops; this is exactly what the Intel Core 2 and a few research
processors already do.


\section{x86-64}

The 64-bit x86-64 instruction set is a fairly straightforward extension
of the 32-bit IA-32 (x86) instruction set. The x86-64 ISA was introduced
by AMD in 2000 with its K8 microarchitecture; the same instructions
were subsequently plagiarized by Intel under a different name ({}``EM64T'')
several years later. In addition to extending all integer registers
and ALU datapaths to 64 bits, x86-64 also provides a total of 16 integer
general purpose registers and 16 SSE (vector floating and fixed point)
registers. It also introduced several 64-bit address space simplifications,
including RIP-relative addressing and corresponding new addressing
modes, and eliminated a number of legacy features from 64-bit mode,
including segmentation, BCD arithmetic, some byte register manipulation,
etc. Limited forms of segmentation are still present to allow thread
local storage and mark code segments as 64-bit. In general, the encoding
of x86-64 and x86 are very similar, with 64-bit mode adding a one
byte REX prefix to specify additional bits for source and destination
register indexes and effective address size. As a result, both variants
can be decoded by similar decoding logic into a common set of uops.


\section{\label{sub:UnalignedLoadsAndStores}Unaligned Loads and Stores}

Compared to RISC architectures, the x86 architecture is infamous for
its relatively widespread use of unaligned memory operations; any
implementation must efficiently handle this scenario. Fortunately,
analysis shows that unaligned accesses are rarely in the performance
intensive parts of a modern program (with the exception of certain
media processing algorithms). Once a given load or store is known
to frequently have an unaligned address, it can be preemptively split
into two aligned loads or stores at decode time. PTLsim does this
by initially causing all unaligned loads and stores to raise an \texttt{\footnotesize UnalignedAccess}
internal exception, forcing a pipeline flush. At this point, the special
\texttt{\footnotesize unaligned} bit is set for the problem load or
store uop in its translated basic block representation. The next time
the offending uop is encountered, it will be split into two parts
very early in the pipeline.

PTLsim includes special uops to handle loads and stores split into
two in this manner. The \texttt{\footnotesize ld.lo} uop rounds down
its effective address $\left\lfloor A\right\rfloor $ to the nearest
64-bit boundary and performs the load. The \texttt{\footnotesize ld.hi}
uop rounds up to $\left\lceil A+8\right\rceil $, performs another
load, then takes as its third rc operand the first (\texttt{\footnotesize ld.lo})
load's result. The two loads are concatenated into a 128-bit word
and the final unaligned data is extracted. Stores are handled in a
similar manner, with \texttt{\footnotesize st.lo} and \texttt{\footnotesize st.hi}
rounding down and up to store parts of the unaligned value in adjacent
64-bit blocks. Depending on the core model, these unaligned load or
store pairs access separate store buffers for each half as if they
were independent.


\section{Repeated String Operations}

The x86 architecture allows for repeated string operations, including
block moves, stores, compares and scans. The iteration count of these
repeated operations depends on a combination of the \texttt{\footnotesize rcx}
register and the flags set by the repeated operation (e.g. compare).
To translate these instructions, PTLsim treats the \texttt{\footnotesize rep
xxx} instruction as a single basic block; any basic block in progress
before the repeat instruction is terminated and the repeat is decoded
as a separate basic block. To handle the unusual case where the repeat
count is zero, a check uop (see below) is inserted at the top of the
loop to protect against this case; PTLsim simply bypasses the offending
block if the check fails.


\section{Checks and SkipBlocks}

PTLsim includes special uops (\texttt{\footnotesize chk.and.cc}, \texttt{\footnotesize chk.sub.cc})
that compare two values or condition codes and cause a special internal
exception if the result is true. The \texttt{\footnotesize SkipBlock}
internal exception generated by these uops tells the core to literally
annul all uops in this instruction, dynamically turning it into a
nop. As described above, this is useful for string operations where
a zero count causes all of the instruction's side effects to be annulled.
Similarly, the \texttt{\footnotesize AssistCheck} internal exception
dynamically turns the instruction into an assist, for those cases
where certain rare conditions may require microcode intervention more
complex than can be inlined into the decoded instruction stream.


\section{\label{sec:ShiftRotateProblems}Shifts and Rotates}

The shift and rotate instructions have some of the most bizarre semantics
in the entire x86 instruction set: they may or may not modify a subset
of the flags depending on the rotation count operand, which we may
not even know until the instruction issues. For fixed shifts and rotates,
these semantics can be preserved by the uops generated, however variable
rotations are more complex. The \texttt{\footnotesize collcc} uop
is put to use here to collect all flags; the collected result is then
fed into the shift or rotate uop as its \texttt{\footnotesize rc}
operand; the uop then replicates the precise x86 behavior (including
rotates using the carry flag) according to its input operands.


\section{SSE Support}

PTLsim provides full support for SSE and SSE2 vector floating point
and fixed point, in both scalar and vector mode. As is done in the
AMD K8 and Pentium 4, each SSE operation on a 128-bit vector is split
into two 64-bit halves; each half (possibly consisting of a 64-bit
load and one or more FPU operations) is scheduled independently. Because
SSE instructions do not set flags like x86 integer instructions, architectural
state management can be restricted to the 16 128-bit SSE registers
(represented as 32 paired 64-bit registers). The \texttt{\footnotesize mxcsr}
(media extensions control and status register) is represented as an
internal register that is only read and written by serializing microcode;
since the exception and status bits are {}``sticky'' (i.e. only
set, never cleared by hardware), this has no effect on out of order
execution. The processor's floating point units can operate in either
64-bit IEEE double precision mode or on two parallel 32-bit single
precision values.

PTLsim also includes a variety of vector integer uops used to construct
SSE2/MMX operations, including packed arithmetic and shuffles.


\section{\label{sub:x87-Floating-Point}x87 Floating Point}

The legacy x87 floating point architecture is the bane of all x86
processor vendors' existence, largely because its stack based nature
makes out of order processing so difficult. While there are certainly
ways of translating stack based instruction sets into flat addressing
for scheduling purposes, we do not do this. Fortunately, following
the Pentium III and AMD Athlon's introduction, x87 is rapidly headed
for planned obsolescence; most major applications released within
the last few years now use SSE instructions for their floating point
needs either exclusively or in all performance critical parts. To
this end, even Intel has relegated x86 support on the Pentium 4 and
Core 2 to a separate low performance legacy unit, and AMD has restricted
x87 use in 64-bit mode. For this reason, PTLsim translates legacy
x87 instructions into a serialized, program ordered and emulated form;
the hardware does not contain any x87-style 80-bit floating point
registers (all floating point hardware is 32-bit and 64-bit IEEE compliant).
We have noticed little to no performance problem from this approach
when examining typical binaries, which rarely if ever still use x87
instructions in compute-intensive code.


\section{Floating Point Unavailable Exceptions}

The x86 architecture specifies a mode in which all floating point
operations (SSE and x87) will trigger a Floating Point Unavailable
exception (\texttt{\footnotesize EXCEPTION\_x86\_fpu\_not\_avail},
vector 0x7) if the \texttt{\footnotesize TS} (task switched) bit in
control register \texttt{\footnotesize CR0} is set. This allows the
kernel to defer saving the floating point registers and state of the
previously scheduled thread until that state is actually modified,
thus speeding up context switches. PTLsim supports this feature by
requiring any commits to the floating point state (SSE XMM registers,
x87 registers or any floating point related control or status registers)
to check the \texttt{\footnotesize uop.is\_sse} and \texttt{\footnotesize uop.is\_x87}
bits in the uop. If either of these is set, the pipeline must be flushed
and redirected into the kernel so it can save the FPU state.


\section{Assists}

Some operations are too complex to inline directly into the uop stream.
To perform these instructions, a special uop (\texttt{\footnotesize brp}:
branch private) is executed to branch to an \emph{assist} function
implemented in microcode. In PTLsim, some assist functions are implemented
as regular C/C++ or assembly language code when they interact with
the rest of the virtual machine. Examples of instructions requiring
assists include system calls, interrupts, some forms of integer division,
handling of rare floating point conditions, CPUID, MSR reads/writes,
various x87 operations, any serializing instructions, etc. These are
listed in the \texttt{\footnotesize ASSIST\_xxx} enum found in \texttt{\footnotesize decode.h}.

Prior to entering an assist, uops are generated to load the \texttt{\footnotesize REG\_selfrip}
and \texttt{\footnotesize REG\_nextrip} internal registers with the
RIP of the instruction itself and the RIP after its last byte, respectively.
This lets the assist microcode correctly update RIP before returning,
or signal a fault on the instruction if needed. Several other assist
related registers, including \texttt{\footnotesize REG\_ar1}, \texttt{\footnotesize REG\_ar2},
\texttt{\footnotesize REG\_ar3}, are used to store parameters passed
to the assist. These registers are not architecturally visible, but
must be renamed and separately maintained by the core as if they were
part of the user-visible state.

While the exact behavior depends on the core model (out of order,
SMT, sequential, etc), generally when the processor fetches an assist
(\texttt{\footnotesize brp} uop), the frontend pipeline is stalled
and execution waits until the \texttt{\footnotesize brp} commits,
at which point an assist function within PTLsim is called. This is
necessary because assists are not subject to the out of order execution
mechanism; they directly update the architectural registers on their
own. In a real processor there are slightly more efficient ways of
doing this without flushing the pipeline, however in PTLsim assists
are sufficiently rare that the performance impact is negligible and
this approach significantly reduces complexity. For the out of order
core, the exact mechanism used is described in Section \ref{sec:PipelineFlushesAndBarriers}.


\chapter{\label{sec:BasicBlockCache}Decoder Architecture and Basic Block
Cache}


\section{Basic Block Cache}

As described in Section \ref{sec:UopIntro}, x86 instructions are
decoded into transops prior to actual execution by the core. To achieve
high performance, PTLsim maintains a \emph{basic block cache} (BB
cache) containing the program ordered translated uop (\emph{transop})
sequence for previously decoded basic blocks in the program. Each
basic block (\texttt{\footnotesize BasicBlock} structure) consists
of up to 64 uops and is terminated by either a control flow operation
(conditional, unconditional, indirect branch) or a barrier operation,
i.e. a microcode assist (including system calls and serializing instructions).


\section{\label{sec:RIPVirtPhys}Identifying Basic Blocks}

In a userspace only simulator, the RIP of a basic block's entry point
(plus a few other attributes described below) serves to uniquely identify
that basic block, and can be used as a key in accessing the basic
block cache. In a full system simulator, the BB cache must be indexed
by much more than just the virtual address, because of potential virtual
page aliasing and the need to persistently cache translations across
context switches. The following fields, in the \emph{RIPVirtPhys}
structure, are required to correctly access the BB cache in any full
system simulator or binary translation system (128 bits total):

\begin{itemize}
\item \texttt{\footnotesize rip:} Virtual address of first instruction in
BB (48 bits), since embedded RIP-relative constants and branch encodings
depend on this. Modern OS's map shared libraries and binaries at the
same addresses every time, so translation caching remains effective
across runs.
\item \texttt{\footnotesize mfnlo:} MFN (Machine Frame Number, i.e. physical
page frame number) of first byte in BB (28 bits), since we need to
handle self modifying code invalidations based on physical addresses
(because of possible virtual page aliasing in multiple page tables)
\item \texttt{\footnotesize mfnhi:} MFN of last byte in BB (28 bits), since
a single x86 basic block can span up to two pages. In pathological
cases, it is possible to create two page tables that both map the
same MFN X at virtual address V, but map different MFNs at virtual
address V+4096. If an instruction crosses this page boundary, the
meaning of the instruction bytes on the second page will be different;
hence we must take into account both physical pages to look up the
correct translation.
\item Context info (up to 24 bits), since the uops generated depend on the
current CPU mode and CS descriptor settings

\begin{itemize}
\item \texttt{\footnotesize use64:} 32-bit or 64-bit mode? (encoding differences)
\item \texttt{\footnotesize kernel:} Kernel or user mode?
\item \texttt{\footnotesize df:} EFLAGS status (direction flag, etc)
\item Other info (e.g. segmentation assumptions, etc.)
\end{itemize}
\end{itemize}
The basic block cache is always indexed using an \texttt{\footnotesize RIPVirtPhys}
structure instead of a simple RIP. To do this, the \texttt{\footnotesize RIPVirtPhys.rip}
field is set to the desired RIP, then \texttt{\footnotesize RIPVirtPhys.update(ctx)}
is called to translate the virtual address onto the two physical page
MFNs it could potentially span (assuming the basic block crosses two
pages).

Notice that the other attribute bits (\texttt{\footnotesize use64},
\texttt{\footnotesize kernel}, \texttt{\footnotesize df}) mean that
two distinct basic blocks may be decoded from the exact same RIP on
the same physical page(s), yet the uops in each translated basic block
will be different because the two basic blocks were translated in
a different context (relative to these attribute bits). This is especially
important for x86 string move/compare/store/load/scan instructions
(\texttt{\footnotesize MOVSB}, \texttt{\footnotesize CMPSB}, \texttt{\footnotesize STOSB},
\texttt{\footnotesize LODSB}, \texttt{\footnotesize SCASB}), since
the correct increment constants depend on the state of the direction
flag in the context in which the BB was used. Similarly, if a user
program tries to decode a supervisor-only opcode, code to call the
general protection fault handler will be produced instead of the real
uops produced only in kernel mode.


\section{\label{sec:InvalidTranslations}Invalid Translations}

The \texttt{\footnotesize BasicBlockCache.translate(ctx, rvp)} function
\emph{always} returns a \texttt{\footnotesize BasicBlock} object,
even if the specified RIP was on an invalid page or some of the instruction
bytes were invalid. When decoding cannot continue for some reason,
the decoder simply outputs a microcode branch to one of the following
assists:

\begin{itemize}
\item \texttt{\footnotesize ASSIST\_INVALID\_OPCODE} when the opcode or
instruction operands are invalid relative to the current context.
\item \texttt{\footnotesize ASSIST\_EXEC\_PAGE\_FAULT} when the specified
RIP falls on an invalid page. This means a page is marked as not present
in the current page table at the time of decoding, or the page is
present but has its NX (no execute) bit set in the page table entry.
The \texttt{\footnotesize EXEC\_PAGE\_FAULT} assist is also generated
when the page containing the RIP itself is valid, but part of an instruction
extends beyond that page onto an invalid page. The decoder tries to
decode as many instruction bytes as possible, but will insert an \texttt{\footnotesize EXEC\_PAGE\_FAULT}
assist whenever it determines, based on the bytes already decoded,
that the remainder of the instruction would fall on the invalid page.
\item \texttt{\footnotesize ASSIST\_GP\_FAULT} when attempting to decode
a restricted kernel-only opcode while running in user mode.
\end{itemize}
Before redirecting execution to the kernel's exception handler, the
\texttt{\footnotesize EXEC\_PAGE\_FAULT} microcode verifies that the
page in question is still invalid. This avoids a spurious page fault
in the case where an instruction was originally decoded on an invalid
page, but the page tables were updated after the translation was first
made such that the page is now valid. When this is the case, all bogus
basic blocks on the page (which were decoded into a call to \texttt{\footnotesize EXEC\_PAGE\_FAULT})
must be invalidated, allowing a correct translation to be made now
that the page is valid. The page at the virtual address after the
page in question may also need to be invalidated in the case where
some instruction bytes cross the page boundary.


\section{\label{sec:SelfModifyingCode}Self Modifying Code}

In x86 processors, the translation process is considerably more complex,
because of self modifying code (SMC) and its variants. Specifically,
the instruction bytes of basic blocks that have already been translated
and cached may be overwritten; these old translations must be discarded.
The x86 architecture guarantees that all code modifications will be
visible immediately after the instruction making the modification;
unlike other architectures, no {}``instruction cache flush'' operation
is provided. Several kinds of SMC must be handled correctly:

\begin{itemize}
\item Classical SMC: stores currently in the pipeline overwrite other instructions
that have already been fetched into the pipeline and even speculatively
executed out of order;
\item Indirect SMC: stores write to a page on which previously translated
code used to reside, but that page is now being reused for unrelated
data or new code. This case frequently arises in operating system
kernels when pages are swapped in and out from disk.
\item Cross-modifying SMC: in a multiprocessor system, one processor overwrites
instructions that are currently in the pipeline on some other core.
The x86 standard is ambiguous here; technically no pipeline flush
and invalidate is required; instead, the cache coherence mechanism
and software mutexes are expected to prevent this case.
\item External SMC: an external device uses direct memory access (DMA) to
overwrite the physical DRAM page containing previously translated
code. In theory, this can happen while the affected instructions are
in the pipeline, but in practice no operating system would ever allow
this. However, we still must invalidate any translations on the target
page to prevent them from being looked up far in the future.
\end{itemize}
To deal with all these forms of SMC, PTLsim associates a {}``dirty''
bit with every physical page (this is unrelated to the {}``dirty''
bit in user-visible page table entries). Whenever the first uop in
an x86 instruction (i.e. the {}``SOM'', start-of-macro-op uop) commits,
the current context is used to translate its RIP into the physical
page MFN on which it resides, as described in Section \ref{sec:RIPVirtPhys}.
If the instruction's length in bytes causes it to overlap onto a second
page, that high MFN is also looked up (using the virtual address \emph{rip}
+ 4096). If the dirty bits for either the low or high MFN are set,
this means the instruction bytes may have been modified sometime after
the time they were last translated and added to the basic block cache.
In this case, the pipeline must be flushed, and all basic blocks on
the target MFN (and possibly the overlapping high MFN) must be invalidated
before clearing the dirty bit. Technically the RIP-to-physical translation
would be done in the instruction fetch stage in most core models,
then simply stored as an \texttt{\footnotesize RIPVirtPhys} structure
inside the uop until commit time.

The dirty bit can be set by several events. Obviously any store uops
will set the dirty bit (thus handling the classical, indirect and
cross-modifying cases), but notice that this bit is not checked again
until the first uop in the \emph{next} x86 instruction. This behavior
is required because it is perfectly legal for an x86 store to overwrite
its own instruction bytes, but this does not become visible until
the same instruction executes a second time (otherwise an infinite
loop of invalidations would occur). Microcoded x86 instructions implemented
by PTLsim itself set dirty bits when their constituent internal stores
commit. Finally, DMA transfers and external writes also set the dirty
bit of any pages touched by the DMA operation.

The dirty bit is only cleared when all translated basic blocks are
invalidated on a given page, and it remains clear until the first
write to that page. However, no action is taken when additional basic
blocks are decoded from a page already marked as dirty. This may seem
counterintuitive, but it is necessary to avoid deadlock: if the page
were invalidated and retranslated at fetch time, future stages in
a long pipeline could potentially still have references to unrelated
basic blocks on the page being invalidated. Hence, all invalidations
are checked and processed only at commit time.

Other binary translation based software and hardware \cite{TransmetaPatent.TBit,VMware,QEMU,Simics,SimNow}
have special mechanisms for write protecting physical pages, such
that when a page with translations is first written by stores or DMA,
the system immediately invalidates all translations on that page.
Unfortunately, this scheme has a number of disadvantages. First, patents
cover its implementation \cite{TransmetaPatent.SelfRevalTrans,TransmetaPatent.SubPageTBit,TransmetaPatent.TBit},
which we would like to avoid. In addition, our design eliminates forced
invalidations when the kernel frees up a page containing code that's
immediately overwritten with normal user data (a very common pattern
according to our studies). If that page is never executed again, any
translations from it will be discarded in the background by the LRU
mechanism, rather than interrupting execution to invalidate translations
that will never be used again anyway. Fortunately, true classical
SMC is very rare in modern x86 code, in large part because major microprocessors
have slapped a huge penalty on its use (particularly in the case of
the Pentium 4 and Transmeta processors, both of which store translated
uops in a cache similar to PTLsim's basic block cache).


\section{\label{sec:BasicBlockReclaim}Memory Management of the Basic Block
Cache}

The PTLsim memory manager (in \texttt{\footnotesize mm.cpp}, see Section
\ref{sec:MemoryManager} for details) implements a reclaim mechanism
in which other subsystems register functions that get called when
an allocation fails. The basic block cache registers a callback, \texttt{\footnotesize bbcache\_reclaim()}
and \texttt{\footnotesize BasicBlockCache::reclaim()}, to invalidate
and free basic blocks when PTLsim runs out of memory.

The algorithm used to do this is a pseudo-LRU design. Every basic
block has a \texttt{\footnotesize lastused} field that gets updated
with the current cycle number whenever \texttt{\footnotesize BasicBlock::use(sim\_cycle)}
is called (for instance, in the fetch stage of a core model). The
reclaim algorithm goes through all basic blocks and calculates the
oldest, average and newest \texttt{\footnotesize lastused} cycles.
The second pass then invalidates any basic blocks that fall below
this average cycle; typically around half of all basic blocks fall
in the least recently used category. This strategy has proven very
effective in freeing up a large amount of space without discarding
currently hot basic blocks.

Each basic block also has a reference counter, \texttt{\footnotesize refcount},
to record how many pointers or references to that basic block currently
exist anywhere inside PTLsim (especially in the pipelines of core
models). The \texttt{\footnotesize BasicBlock::acquire()} and \texttt{\footnotesize release()}
methods adjust this counter. Core models should acquire a basic block
once for every uop in the pipeline within that basic block; the basic
block is released as uops commit or are annulled. Since basic blocks
may be speculatively translated in the fetch stage of core models,
this guarantees that live basic blocks currently in flight are never
freed until they actually leave the pipeline.


\chapter{PTLsim Support Subsystems}


\section{\label{sec:UopImplementations}Uop Implementations}

PTLsim provides implementations for all uops in the \texttt{\footnotesize uopimpl.cpp}
file. C++ templates are combined with gcc's smart inline assembler
type selection constraints to translate all possible permutations
(sizes, condition codes, etc) of each uop into highly optimized code.
In many cases, a real x86 instruction is used at the core of each
corresponding uop's implementation; code after the instruction just
captures the generated x86 condition code flags, rather than having
to manually emulate the same condition codes ourselves. The code implementing
each uop is then called from elsewhere in the simulator whenever that
uop must be executed. Note that loads and stores are implemented elsewhere,
since they are too dependent on the specific core model to be expressed
in this generic manner.

An additional optimization, called \emph{synthesis}, is also used
whenever basic blocks are translated. Each uop in the basic block
is mapped to the address of a native PTLsim function in \texttt{\footnotesize uopimpl.cpp}
implementing the semantics of that uop; this function pointer is stored
in the \texttt{\footnotesize synthops{[}]} array of the \texttt{\footnotesize BasicBlock}
structure. This saves us from having to use a large jump table later
on, and can map uops to pre-compiled templates that avoid nearly all
further decoding of the uop during execution.


\section{Configuration Parser}

PTLsim supports a wide array of command line or scriptable configuration
options, described in Section \ref{sec:ConfigurationOptions}. The
configuration parser engine (used by both PTLsim itself and utilities
like PTLstats) is in \texttt{\footnotesize config.cpp} and \texttt{\footnotesize config.h}.
For PTLsim itself, each option is declared in three places:

\begin{itemize}
\item \texttt{\footnotesize ptlsim.h} declares the \texttt{\footnotesize PTLsimConfig}
structure, which is available from anywhere as the \texttt{\footnotesize config}
global variable. The fields in this structure must be of one of the
following types: \texttt{\footnotesize W64} (64-bit integer), \texttt{\footnotesize double}
(floating point), \texttt{\footnotesize bool} (on/off boolean), or
\texttt{\footnotesize stringbuf} (for text parameters).
\item \texttt{\footnotesize ptlsim.cpp} declares the \texttt{\footnotesize PTLsimConfig::reset()}
function, which sets each option to its default value.
\item \texttt{\footnotesize ptlsim.cpp} declares the \texttt{\footnotesize ConfigurationParser<PTLsimConfig>::setup()}
template function, which registers all options with the configuration
parser.
\end{itemize}

\section{\label{sec:MemoryManager}Memory Manager}


\subsection{Memory Pools}

PTLsim uses its own custom memory manager for all allocations, given
its specialized constraints (particularly for PTLsim/X, which runs
on the bare hardware). The PTLsim memory manager (in \texttt{\footnotesize mm.cpp})
uses three key structures.

The \emph{page allocator} allocates spans of one or more virtually
contiguous pages. In userspace-only PTLsim, the page allocator doesn't
really exist: it simply calls \texttt{\footnotesize mmap()} and \texttt{\footnotesize munmap()},
letting the host kernel do the actual allocation. In the full system
PTLsim/X, the page allocator actually works with physical pages and
is based on the extent allocator (see below). The \texttt{\footnotesize ptl\_alloc\_private\_pages()}
and \texttt{\footnotesize ptl\_free\_private\_pages()} functions should
be used to directly allocate page-aligned memory (or individual pages)
from this pool.

The \emph{general allocator} uses the \texttt{\footnotesize ExtentAllocator}
template class to allocate large objects (greater than page sized)
from a pool of free extents. This allocator automatically merges free
extents and can find a matching free block in O(1) time for any allocation
size. The general allocator obtains large chunks of memory (typically
64 KB at once) from the page allocator, then sub-divides these extents
into individual allocations.

The \emph{slab allocator} maintains a pool of page-sized {}``slabs''
from which fixed size objects are allocated. Each page only contains
objects of one size; a separate slab allocator handles each size from
16 bytes up to 1024 bytes, in 16-byte increments. The allocator provides
extremely fast allocation performance for object oriented programs
in which many objects of a given size are allocated. The slab allocator
also allocates one page at a time from the global page allocator.
However, it maintains a pool of empty pages to quickly satisfy requests.
This is the same architecture used by the Linux kernel to satisfy
memory requests.

The \texttt{\footnotesize ptl\_mm\_alloc()} function intelligently
decides from which of the two allocators (general or slab) to allocate
a given sized object, based on the size in bytes, object type and
caller. The standard \texttt{\footnotesize new} operator\texttt{\footnotesize{}
and malloc()} both use this function. Similarly, the \texttt{\footnotesize ptl\_mm\_free()}
function frees memory. PTLsim uses a special bitmap to track which
pages are slab allocator pages; if a pointer falls within a slab,
the slab deallocator is used; otherwise the general allocator is used
to free the extent.


\subsection{Garbage Collection and Reclaim Mechanism}

The memory manager implements a garbage collection mechanism with
which other subsystems register reclaim functions that get called
when an allocation fails. The \texttt{\footnotesize ptl\_mm\_register\_reclaim\_handler()}
function serves this role. Whenever an allocation fails, the reclaim
handlers are called in sequence, followed by an extent cleanup pass,
before retrying the allocation. This process repeats until the allocation
succeeds or an abort threshold is reached.

The reclaim function gets passed two parameters: the size in bytes
of the failed allocation, and an \emph{urgency} parameter. If \emph{urgency}
is 0, the subsystem registering the callback should do everything
in its power to free all memory it owns. Otherwise, the subsystem
should progressively trim more and more unused memory with each call
(and increasing urgency). \emph{Under no circumstances} is a reclaim
handler allowed to allocate \emph{any} additional memory! Doing so
will create an infinite loop; the memory manager will detect this
and shut down PTLsim if it is attempted.


\chapter{\label{sec:StatisticsInfrastructure}Statistics Collection and Analysis}


\section{PTLsim Statistics Data Store}


\subsection{Introduction}

PTLsim maintains a huge number of statistical counters and data points
during the simulation process, and can optionally save this data to
a file by using the {}``\texttt{\footnotesize -stats} \emph{filename}''
configuration option. The data store is a binary file format used
to efficiently capture large quantities of statistical information
for later analysis. This file format supports storing multiple regular
or triggered snapshots of all counters. Snapshots can be subtracted,
averaged and extensively manipulated, as will be described later on.

PTLsim makes it trivial to add new performance counters to the statistics
data tree. All counters are defined in \texttt{\footnotesize stats.h}
as a tree of nested structures; the top-level \texttt{\footnotesize PTLsimStats}
structure is mapped to the global variable \texttt{\footnotesize stats},
so counters can be directly updated from within the code by simple
increments, e.g. \texttt{\footnotesize stats.xxx.yyy.zzz.countername++}.
Every node in the tree can be either a \texttt{\footnotesize struct},
\texttt{\footnotesize W64}{\footnotesize{} }(64-bit integer), \texttt{\footnotesize double}
(floating point) or \texttt{\footnotesize char} (string) type; arrays
of these types are also supported. In addition, various attributes,
described below, can be attached to each node or counter to specify
more complex semantics, including histograms, labeled arrays, summable
nodes and so on.

PTLsim comes with a special script, \texttt{\footnotesize dstbuild}
({}``data store template builder'') that parses \texttt{\footnotesize stats.h}
and constructs a binary representation (a {}``template'') describing
the structure; this template data is then compiled into PTLsim. Every
time PTLsim creates a statistics file, it first writes this template,
followed by the raw \texttt{\footnotesize PTLsimStats} records and
an index of those records by name. In this way, the complete data
store tree can be reconstructed at a later time even if the original
\texttt{\footnotesize stats.h} or PTLsim version that created the
file is unavailable. This scheme is analogous to the separation of
XML schemas (the template) from the actual XML data (the stats records),
but in our case the template and data is stored in binary format for
efficient parsing.

We suggest using the data store mechanism to store \emph{all} statistics
generated by your additions to PTLsim, since this system has built-in
support for snapshots, checkpointing and structured easy to parse
data (unlike simply writing values to a text file). It is further
suggested that only raw values be saved, rather than doing computations
in the simulator itself - leave the analysis to PTLstats after gathering
the raw data. If some limited computations do need to be done before
writing each statistics record, PTLsim will call the PTLsimMachine::update\_stats()
virtual method to allow your model a chance to do so before writing
the counters.


\subsection{\label{sec:StatisticsNodeAttributes}Node Attributes}

After each node or counter is declared, one of several special C++-style
{}``//'' comments can be used to specify \emph{attributes} for that
node:

\begin{itemize}
\item \texttt{\small struct Name \{ // rootnode:}{\small \par}


The node is at the root of the statistics tree (typically this only
applies to the PTLsimStats structure itself)

\item \texttt{\small struct Name \{ // node: summable}{\small \par}


All subnodes and counters under this node are assumed to total 100\%
of whatever quantity is being measured. This attribute tells PTLstats
to print percentages next to the raw values in this subtree for easier
viewing.

\item \texttt{\small W64 name{[}arraysize]; // histo:}{\small{} }\texttt{\emph{\small min,}}{\small{}
}\texttt{\emph{\small max,}}{\small{} }\texttt{\emph{\small stride}}{\small \par}


Specifies that the array of counters forms a \emph{histogram}, i.e.
each slot in the array represents the number of occurrences of one
event out of a mutually exclusive set of events. The \emph{min} parameter
specifies the meaning of the first slot (array element 0), while the
\emph{max} parameter specifies the meaning of the last slot (array
element \emph{arraysize}-1). The \emph{stride} parameter specifies
how many events are counted into every slot (typically this is 1).

For example, let's say you want to measure the frequency distribution
of the number of consumers of each instruction's result, where the
maximum number of possible consumers is 256. You could specify this
as:

\begin{quote}
\texttt{\small W64 consumers{[}64+1]; // histo: 0, 256, 4}{\small \par}
\end{quote}
This histogram has a logical range of 0 to 256, but is divided into
65 slots. Because the \emph{stride} parameter is 4, any consumer counts
from 0 to 3 increment slot 0, counts from 4 to 7 increment slot 1,
and so on. When you update this counter array from inside the model,
you should do so as follows:

\begin{quote}
\texttt{\small stats.xxx.yyy.consumers{[}min(n / 4, 64)]++;}{\small \par}
\end{quote}
\item \texttt{\small W64 name{[}arraysize]; // label: namearray}{\small \par}


Specifies that the array of counters is a histogram of named, mutually
exclusive events, rather than simply raw numbers (as with the \emph{histo}
attribute). The \emph{namearray} must be the name of an array of \emph{arraysize}
strings, with one entry per event.

For example, let's say you want to measure the frequency distribution
of uop types PTLsim is executing. If there are OPCLASS\_COUNT, you
could declare the following:

\begin{quote}
\texttt{\small W64 opclass{[}OPCLASS\_COUNT]; // label: opclass\_names}{\small \par}
\end{quote}
In some header file included by \texttt{\small stats.h}, you need
to declare the actual array of slot labels:

\begin{quote}
\texttt{\small static const char{*} opclass\_names{[}OPCLASS\_COUNT]
= \{''logic'', ''addsub'', ''addsubc'', ...\};}{\small \par}
\end{quote}
\end{itemize}

\subsection{\label{sec:StatisticsOptions}Configuration Options}

PTLsim supports several options related to the statistics data store:

\begin{itemize}
\item \texttt{\small -stats}{\small{} }\texttt{\emph{\small filename}}{\small \par}


Specify the filename to which statistics data is written. In reality,
two files are created: \emph{filename} contains the template and snapshot
index, while \emph{filename.data} contains the raw data.

\item \texttt{\small -snapshot-cycles}{\small{} }\texttt{\emph{\small N}}{\small \par}


Creates a snapshot every N simulation cycles, numbered consecutively
starting from 0. Without this option, only one snapshot, named \texttt{\small final},
is created at the end of the simulation run.

\item \texttt{\small -snapshot-now}{\small{} }\texttt{\emph{\small name}}{\small \par}


Creates a snapshot named \emph{name} at the current point in the simulation.
This can be used to asynchronously take a look at a simulation in
progress. \emph{This option is only available in PTLsim/X.}

\end{itemize}

\section{PTLstats: Statistics Analysis and Graphing Tools}

The \textbf{\emph{PTLstats}} program is used to analyze the statistics
data store files produced by PTLsim. PTLstats will first extract the
template stored in all data store files, and will then parse the statistics
records into a flexible tree format that can be manipulated by the
user. The following is an example of one node in the statistics tree,
as printed by PTLstats:

\begin{lyxcode}
{\small dcache~\{}{\small \par}

{\small{}~~store~\{}{\small \par}

{\small{}~~~~issue~(total~68161716)~\{}{\small \par}

{\small{}~~~~{[}~29.7\%~]~replay~(total~20218780)~\{}{\small \par}

{\small{}~~~~{[}~~0.0\%~]~sfr\_addr\_not\_ready~=~0;}{\small \par}

{\small{}~~~~{[}~16.8\%~]~sfr\_data\_and\_data\_to\_store\_not\_ready~=~3405878;}{\small \par}

{\small{}~~~~{[}~11.8\%~]~sfr\_data\_not\_ready~=~2379338;}{\small \par}

{\small{}~~~~{[}~23.4\%~]~sfr\_addr\_and\_data\_to\_store\_not\_ready~=~4740838;}{\small \par}

{\small{}~~~~{[}~24.5\%~]~sfr\_addr\_and\_data\_not\_ready~=~4951888;}{\small \par}

{\small{}~~~~{[}~23.4\%~]~sfr\_addr\_and\_data\_and\_data\_to\_store\_not\_ready~=~4740838;}{\small \par}

{\small{}~~\}}{\small \par}

{\small{}~~{[}~~0.0\%~]~exception~=~30429;}{\small \par}

{\small{}~~{[}~~7.9\%~]~ordering~=~5404592;}{\small \par}

{\small{}~~{[}~62.4\%~]~complete~=~42507854;}{\small \par}

{\small{}~~{[}~~0.0\%~]~unaligned~=~61;}{\small \par}

{\small \}}{\small \par}
\end{lyxcode}
Notice how PTLstats will automatically sum up all entries in certain
branches of the tree to provide the user with a breakdown by percentages
of the total for that subtree in addition to the raw values. This
is achieved using the {}``\texttt{\small // node: summable}'' attribute
as described in Section \ref{sec:StatisticsNodeAttributes}.

Here is an example of a labeled histogram, produced using the {}``\texttt{\small //
label: xxx}'' attribute described in Section \ref{sec:StatisticsNodeAttributes}:

\begin{lyxcode}
{\small size{[}4]~=~\{}{\small \par}

{\small{}~~ValRange:~3209623~90432573}{\small \par}

{\small{}~~Total:~~~107190122}{\small \par}

{\small{}~~Thresh:~~~~~10720}{\small \par}

{\small{}~~{[}~~6.2\%~]~~~~~~~~0~~6686971~1~(byte)}{\small \par}

{\small{}~~{[}~~6.4\%~]~~~~~~~~1~~6860955~2~(word)}{\small \par}

{\small{}~~{[}~84.4\%~]~~~~~~~~2~90432573~4~(dword)}{\small \par}

{\small{}~~{[}~~3.0\%~]~~~~~~~~3~~3209623~8~(qword)}{\small \par}

{\small \};}{\small \par}
\end{lyxcode}

\section{Snapshot Selection}

The basic syntax of the PTLstats command is {}``\texttt{\small ptlstats
-}\emph{options} \emph{filename}''. If no options are specified,
PTLstats prints out the entire statistics tree from its root, relative
to the \texttt{\small final} snapshot.

To select a specific snapshot, use the following option:

\begin{lyxcode}
{\small ptlstats~}\textbf{\small -snapshot}{\small{}~}\emph{\small name-or-number}{\small{}~...}{\small \par}
\end{lyxcode}
Snapshots may be specified by name or number.

It may be desirable to examine the difference in statistics \emph{between}
two snapshots, for instance to subtract out the counters at the starting
point of a given run or after a warmup period. The \texttt{\small -subtract}
option provides this facility, for example:

\begin{lyxcode}
{\small ptlstats~-snapshot~}\emph{\small final}{\small{}~}\textbf{\small -subtract}{\small{}~}\emph{\small startpoint}{\small{}~...}{\small \par}
\end{lyxcode}

\section{Working with Statistics Trees: Collection, Averaging and Summing}

To select a specific subtree of interest, use the syntax of the following
example:

\begin{lyxcode}
{\small ptlstats~}\textbf{\small -snapshot}{\small{}~final~}\textbf{\small -collect}{\small{}~/ooocore/dcache/load~example1.stats~example2.stats~...}{\small \par}
\end{lyxcode}
This will print out the subtree \texttt{\small /ooocore/dcache/load}
in the snapshot named \texttt{\small final} (the default snapshot)
for each of the named statistics files \texttt{\small example1.stats},
\texttt{\small example2.stats} and so on. Multiple files are generally
used to examine a specific subnode across several benchmarks.

Subtrees or individual statistics can also be summed and averaged
across many files, using the \texttt{\textbf{\small -collectsum}}
or \texttt{\textbf{\small -collectaverage}} commands in place of \texttt{\small -collect}.


\section{Traversal and Printing Options}

The \texttt{\textbf{\small -maxdepth}} option is useful for limiting
the depth (in nodes) PTLstats will descend into the specified subtree.
This is appropriate when you want to summarize certain classes of
statistics printed as percentages of the whole, yet don't want a breakdown
of every sub-statistic.

The \texttt{\textbf{\small -percent-of-toplevel}} option changes the
way percentages are displayed. By default, percentages are calculated
by dividing the total value of each node by the total of its immediate
parent node. When \texttt{\small -percent-of-toplevel} is enabled,
the divisor becomes the total of the entire subtree, possibly going
back several levels (i.e. back to the highest level node marked with
the \emph{summable} attribute), rather than each node's immediate
parent.


\section{Table Generation}

PTLstats provides a facility to easily generate R-row by C-column
data tables from a set of R benchmarks run with C different sets of
parameters. Tables can be output in a variety of formats, including
plain text with tab or space delimiters (suitable for import into
a spreadsheet), \LaTeX{} (for direct insertion into research reports)
or HTML. To generate a table, use the following syntax:

\begin{lyxcode}
{\small ptlstats~}\textbf{\small -table}{\small{}~/final/summary/cycles~-rows~gzip,gcc,perlbmk,mesa~-cols~small,large,huge~-table-pattern~\char`\"{}\%row/ptlsim.stats.\%col\char`\"{}}{\small \par}
\end{lyxcode}
In this example, the benchmarks ({}``gzip'', {}``gcc'', {}``perlbmk'',
{}``mesa'') will form the rows of the table, while three trials
done for each benchmark ({}``small'', {}``large'', {}``huge'')
will be listed in the columns. The row and column names will be combined
using the pattern {}``\texttt{\small \%row/ptlsim.stats.\%col}{}``
to generate statistics data store filenames like {}``\texttt{\small gzip/ptlsim.stats.small}''.
PTLstats will then load the data store for each benchmark and trial
combination to create the table.

Notice that you must create your own scripts, or manually run each
benchmark and trial with the desired PTLsim options, plus {}``\texttt{\small -stats
ptlsim.stats.}\texttt{\emph{\small trialname}}''. PTLstats will only
report these results in table form; it will not actually run any benchmarks.

The \texttt{\textbf{\small -tabletype}} option specifies the data
format of the table: {}``\texttt{\small text}'' (plain text with
space delimiters, suitable for import into a spreadsheet), {}``\texttt{\small latex}''
(\LaTeX{} format, useful for directly inserting into research reports),
or {}``\texttt{\small html}'' (HTML format for web pages).

The {}``\texttt{\textbf{\small -scale-relative-to-col}}{\small{} }\texttt{\emph{\small N}}''
option forces PTLstats to compute the percentage of increase or decrease
for each cell relative to the corresponding row in some other reference
column \emph{N}. This is useful when running a {}``baseline'' case,
to be displayed as a raw value (usually the cycle count, \texttt{\small /final/summary/cycles})
in column 0, while all other experimental cases are displayed as a
percentage increase (fewer cycles, for a positive percentage) or percentage
decrease (negative value) relative to this first column (\emph{N}
= 0).


\subsection{Bargraph Generation}

In addition to creating tables, PTLstats can directly create colorful
graphs (in Scalable Vector Graphics (SVG) format) from a set of benchmarks
(specified by the \texttt{\small -rows} option) and trials of each
benchmark (specified by the \texttt{\small -cols} option). For instance,
to plot the total number of cycles taken over a set of benchmarks,
each run under different PTLsim configurations, use the following
example:

\begin{lyxcode}
{\small ptlstats~}\textbf{\small -bargraph}{\small{}~/final/summary/cycles~-rows~gzip,gcc,perlbmk,mesa~-cols~small,large,huge~-table-pattern~\char`\"{}\%row/ptlsim.stats.\%col\char`\"{}}{\small \par}
\end{lyxcode}
In this case, groups of three bars (for the trials {}``small'',
{}``large'', {}``huge'') appear for each benchmark.

The graph's layout can be extensively customized using the options
\texttt{\small -title}, \texttt{\small -width}, \texttt{\small -height}.

Inkscape (http://www.inkscape.org) is an excellent vector graphics
system for editing and formatting SVG files generated by PTLstats.


\section{Histogram Generation}

Certain array nodes in the statistics tree can be tagged as {}``histogram''
nodes by using the \texttt{\small histo:} or \texttt{\small label:}
attributes, as described in Section \ref{sec:StatisticsNodeAttributes}.
For instance, the \texttt{\small ooocore/frontend/consumer-count}
node in the out-of-order core is a histogram node. PTLstats can directly
create graphs (in Scalable Vector Graphics (SVG) format) for these
special nodes, using the \texttt{\textbf{\small -histogram}} option:

\begin{lyxcode}
{\small ptlstats~}\textbf{\small -histogram}{\small{}~/ooocore/frontend/consumer-count~>~example.svg}{\small \par}
\end{lyxcode}
The histogram's layout can be extensively customized using the options
\texttt{\small -title}, \texttt{\small -width}, \texttt{\small -height}.
In addition, the \texttt{\small -percentile} option is useful for
controlling the displayed data range by excluding data under the Nth
percentile. The \texttt{\small -logscale} and \texttt{\small -logk}
options can be used to apply a log scale (instead of a linear scale)
to the histogram bars. The syntax of these options can be obtained
by running \texttt{\small ptlstats} without arguments.


\chapter{Benchmarking Techniques}


\section{\label{sec:TriggerMode}Trigger Mode and other PTLsim Calls From
User Code}

PTLsim optionally allows user code to control the simulator mode through
the \texttt{\small ptlcall\_xxx()} family of functions found in \texttt{\small ptlcalls.h}
when trigger mode is enabled (\texttt{\small -trigger} configuration
option). This file should be included by any PTLsim-aware user programs;
these programs must be recompiled to take advantage of these features.
Amongst the functions provided by \texttt{\small ptlcalls.h} are:

\begin{itemize}
\item \texttt{\small ptlcall\_switch\_to\_sim()} is only available while
the program is executing in native mode. It forces PTLsim to regain
control and begin simulating instructions as soon as this call returns.
\item \texttt{\small ptlcall\_switch\_to\_native()} stops simulation and
returns to native execution, effectively removing PTLsim from the
loop.
\item \texttt{\small ptlcall\_marker()} simply places a user-specified marker
number in the PTLsim log file
\item \texttt{\small ptlcall\_capture\_stats()} adds a new statistics data
store snapshot at the time it is called. You can pass a string to
this function to name your snapshot, but all names must be unique.
\item \texttt{\small ptlcall\_nop()} does nothing but test the call mechanism.
\end{itemize}
In userspace PTLsim, these calls work by forcing execution to code
on a {}``gateway page'' at a specific fixed address (\texttt{\small 0x1000}
currently); PTLsim will write the appropriate call gate code to this
page depending on whether the process is in native or simulated mode.
In native mode, the call gate page typically contains a 64-to-64-bit
or 32-to-64-bit far jump into PTLsim, while in simulated mode it contains
a reserved x86 opcode interpreted by the x86 decoder as a special
kind of system call. If PTLsim is built on a 32-bit only system, no
mode switch is required.

In full system PTLsim/X, the x86 opcodes used to implement these calls
are directly handled by the PTLsim/X hypervisor as if they were actually
part of the native x86 instruction set.

Generally these calls are used to perform {}``intelligent benchmarking'':
the \texttt{\small ptlcall\_switch\_to\_sim()} call is made at the
top of the main loop of a benchmark after initialization, while the
\texttt{\small ptlcall\_switch\_to\_native()} call is inserted after
some number of iterations to stop simulation after a representative
subset of the code has completed. This intelligent approach is far
better than the blind {}``sample for N million cycles after S million
startup cycles'' approach used by most researchers.

Fortran programs will have to actually link in the \texttt{\small ptlcalls.o}
object file, since they cannot include C header files. The function
names that should be used in the Fortran code remain the same as those
from the \texttt{\small ptlcalls.h} header file.


\section{\label{sec:IPCNotes}Notes on Benchmarking Methodology and {}``IPC''}

The x86 instruction set requires some different benchmarking techniques
than classical RISC ISAs. In particular, \textbf{uIPC (Micro-Instructions
per Cycle) a NOT a good measure of performance for an x86 processor.}
Because one x86 instruction may be broken up into numerous uops, it
is never appropriate to compare IPC figures for committed x86 instructions
per clock with IPC values from a RISC machine. Furthermore, different
x86 implementations use varying numbers of uops per x86 instruction
as a matter of encoding, so even comparing the uop based IPC between
x86 implementations or RISC-like machines is inaccurate.

Users are strongly advised to use relative performance measures instead.
Comparing the total simulated cycle count required to complete a given
benchmark between different simulator configurations is much more
appropriate than IPC with the x86 instruction set. An example would
be {}``the baseline took 100M cycles, while our improved system took
50M cycles, for a 2x improvement''.


\section{\label{sec:SimulationWarmupPeriods}Simulation Warmup Periods}

In some simulators, it is possible to quickly skip through a specific
number of instructions before starting to gather statistics, to avoid
including initialization code in the statistics. In PTLsim, this is
neither necessary nor desirable. Because PTLsim directly executes
your program on the host CPU until it switches to cycle accurate simulation
mode, there is no way to count instructions in this manner. 

Many researchers have gotten in the habit of blindly skipping a large
number of instructions in benchmarks to avoid profiling initialization
code. However, this is not a very intelligent policy: different benchmarks
have different startup times until the top of the main loop is reached,
and it is generally evident from the benchmark source code where that
point should be. Therefore, PTLsim supports \textbf{trigger points:}
by inserting a special function call (\texttt{\footnotesize ptlcall\_switch\_to\_sim})
within the benchmark source code and recompiling, the \texttt{\footnotesize -trigger}
PTLsim option can be used to run the code on the host CPU until the
trigger point is reached. If the source code is unavailable, the \texttt{\footnotesize -startrip}{\footnotesize{}
}\texttt{\emph{\footnotesize 0xADDRESS}} option will start full simulation
only at a specified address (e.g. function entry point). 

If you want to warm up the cache and branch predictors prior to starting
statistics collection, combine the \texttt{\footnotesize -trigger}
option with the \texttt{\footnotesize -snapshot-cycles}{\footnotesize{}
}\texttt{\emph{\footnotesize N}} option, to start full simulation
at the top of the benchmark's main loop (where the trigger call is),
but only start gathering statistics \emph{N} cycles later, after the
processor is warmed up. Remember, since the trigger point is placed
\emph{after} all initialization code in the benchmark, in general
it is only necessary to use 10-20 million cycles of warmup time before
taking the first statistics snapshot. In this time, the caches and
branch predictor will almost always be completely overwritten many
times. This approach significantly speeds up the simulation without
any loss of accuracy compared to the \char`\"{}fast simulation\char`\"{}
mode provided by other simulators. 

In PTLstats, use the \texttt{\footnotesize -subtract} option to make
sure the final statistics don't include the warmup period before the
first snapshot. To subtract the final snapshot from snapshot 0 (the
first snapshot after the warmup period), use a command similar to
the following:

\begin{lyxcode}
{\footnotesize ptlstats~-subtract~0~ptlsim.stats}{\footnotesize \par}
\end{lyxcode}

\section{\label{sec:SequentialMode}Sequential Mode}

PTLsim supports \emph{sequential mode}, in which instructions are
run on a simple, in-order processor model (in \texttt{\footnotesize seqcore.cpp})
without accounting for cache misses, branch mispredicts and so forth.
This is much faster than the out of order model, but is obviously
slower than native execution. The purpose of sequential mode is mainly
to aid in testing the x86 to uop decoder, microcode functions and
RTL-level uop implementation code. It may also be useful for gathering
certain statistics on the instruction mix and count without running
a full simulation.

\emph{NOTE:} Sequential mode is \emph{not} intended as a {}``warmup
mode'' for branch predictors and caches. If you want this behavior,
use statistical snapshot deltas as described in Section \ref{sec:SimulationWarmupPeriods}. 

Sequential mode is enabled by specifying the {}``\texttt{\footnotesize -core
seq}'' option. It has no other core-specific options.


\part{\label{sec:PTLsimClassic}PTLsim Classic: Userspace Linux Simulation}


\chapter{Getting Started with PTLsim}

\emph{NOTE:} This part of the manual is relevant only if you are using
the classic userspace-only version of PTLsim. If you are looking for
the full system SMP/SMT version, PTLsim/X, please skip this entire
part and read Part \ref{sec:PTLsimFullSystem} instead.


\section{Building PTLsim}

Prerequisites:

\begin{itemize}
\item PTLsim can be built on \textbf{both 64-bit x86-64 machines} (AMD Athlon
64 / Opteron / Turion, Intel Pentium 4 with EM64T and Intel Core 2)
\textbf{as well as ordinary 32-bit x86 systems}. In either case, your
system must support SSE2 instructions; all modern CPUs made in the
last few years (such as Pentium 4 and Athlon 64) support this, but
older CPUs (Pentium III and earlier) specifically do \emph{not} support
PTLsim.
\item If built for x86-64, PTLsim will run both 64-bit and 32-bit programs
automatically. If built on a 32-bit Linux distribution and compiler,
PTLsim only supports ordinary x86 programs and will typically be slower
than the 64-bit build\textbf{,} even on 32-bit user programs.
\item PTLsim runs on any recent Linux 2.6 based distribution.
\item We have successfully built PTLsim with gcc 3.3, 3.4.x and 4.1.x+ (gcc
4.0.x has documented bugs affecting some of our code).
\end{itemize}
Quick Start Steps:

\begin{itemize}
\item Download PTLsim from our web site (\texttt{\footnotesize http://www.ptlsim.org/download.php}).
We recommend starting with the {}``stable'' version, since this
contains all the files you need and can be updated later if desired.
\item Unpack \texttt{\footnotesize ptlsim-2006xxxx-rXXX.tar.gz} to create
the \texttt{\footnotesize ptlsim} directory.
\item Run \texttt{\footnotesize make}.

\begin{itemize}
\item The Makefile will detect your platform and automatically compile the
correct version of PTLsim (32-bit or 64-bit).
\end{itemize}
\end{itemize}

\section{\label{sec:RunningPTLsim}Running PTLsim}

PTLsim invocation is very simple: after compiling the simulator and
making sure the \texttt{\small ptlsim} executable is in your path,
simply run:

\begin{quote}
\texttt{\footnotesize ptlsim}\texttt{\small ~} \emph{full-path-to-executable}
\emph{arguments...}
\end{quote}
PTLsim reads configuration options for running various user programs
by looking for a configuration file named \texttt{\footnotesize /home/}\texttt{\emph{\footnotesize username}}\texttt{\footnotesize /.ptlsim/}\texttt{\emph{\footnotesize path/to/program/executablename}}\texttt{\footnotesize .conf}.
To set options for each program, you'll need to create a directory
of the form \texttt{\footnotesize /home/}\texttt{\emph{\footnotesize username}}\texttt{\footnotesize /.ptlsim}
and make sub-directories under it corresponding to the full path to
the program. For example, to configure \texttt{\footnotesize /bin/ls}
you'll need to run \char`\"{}\texttt{\footnotesize mkdir /home/}\texttt{\emph{\footnotesize username}}\texttt{\footnotesize /.ptlsim/bin}''
and then edit \char`\"{}\texttt{\footnotesize /home/}\texttt{\emph{\footnotesize username}}\texttt{\footnotesize /.ptlsim/bin/ls.conf}\char`\"{}
with the appropriate options. For example, try putting the following
in \texttt{\footnotesize ls.conf} as described:

\begin{quote}
\texttt{\footnotesize -logfile ptlsim.log -loglevel 9 -stats ls.stats
-stopinsns 10000}{\footnotesize \par}
\end{quote}
Then run:

\begin{quote}
\texttt{\footnotesize ptlsim /bin/ls -la}{\footnotesize \par}
\end{quote}
PTLsim should display its system information banner, then the output
of simulating the directory listing. With the options above, PTLsim
will simulate \texttt{\footnotesize /bin/ls} starting at the first
x86 instruction in the dynamic linker's entry point, run until 10000
x86 instructions have been committed, and will then switch back to
native mode (i.e. the user code will run directly on the real processor)
until the program exits. During this time, it will compile an extensive
log of the state of every micro-operation executed by the processor
and will save it to {}``\texttt{\footnotesize ptlsim.log}'' in the
current directory. It will also create {}``\texttt{\footnotesize ls.stats}'',
a binary file containing snapshots of PTLsim's internal performance
counters. The \texttt{\footnotesize ptlstats} program (Chapter \ref{sec:StatisticsInfrastructure})
can be used to print and analyze these statistics by running {}``\texttt{\footnotesize ptlstats
ls.stats}''.


\section{\label{sec:ConfigurationOptions}Configuration Options}

PTLsim supports a variety of options in the configuration file of
each program; you can run {}``\texttt{\footnotesize ptlsim}'' without
arguments to get a full list of these options. The following sections
only list the most useful options, rather than every possible option.

The configuration file can also contain comments (starting with {}``\texttt{\footnotesize \#}''
at any point on a line) and blank lines; the first non-comment line
is used as the active configuration.

PTLsim supports multiple models of various microprocessor cores; the
{}``\texttt{\footnotesize -core} \emph{corename}'' option can be
used to choose a specific core. The default core is {}``\texttt{\footnotesize ooo}'',
the dynamically scheduled out of order superscalar core described
in great detail in Part \ref{part:OutOfOrderModel}. PTLsim also comes
with a simple sequential in-order core, {}``\texttt{\footnotesize seq}''.
It is most useful for debugging decoding and microcode issues rather
than actual performance profiling.


\section{Logging Options}

PTLsim can log all simulation events to a log file, or can be instructed
to log only a subset of these events, starting and stopping at various
points:

\begin{itemize}
\item \texttt{\footnotesize -logfile} \emph{filename}


Specifies the file to which log messages will be written.

\item \texttt{\footnotesize -loglevel} \emph{level}


Selects a subset of the events that will be logged:

\begin{itemize}
\item 0 disables logging
\item 1 displays only critical events (such as system calls and state changes)
\item 2-3 displays less critical simulator-wide events
\item 4 displays major events within the core itself (like pipeline flushes,
basic block decodes, etc)
\item 6 displays \emph{all} events that occur within each pipeline stage
of the core every cycle
\item 99 displays every possible event. This will create massive log files!
\end{itemize}
\item \texttt{\footnotesize -startlog} \emph{cycle}


Starts logging only after \emph{cycle} cycles have elapsed from the
start of the simulation.

\item \texttt{\footnotesize -startlogrip} \emph{rip}


Starts logging only after the first time the instruction at \emph{rip}
is decoded or executed. This is mutually exclusive with \texttt{\footnotesize -startlog}.

\end{itemize}

\section{\label{sec:EventLogRingBuffer}Event Log Ring Buffer}

PTLsim also maintains an event log ring buffer. Every time the core
takes some action (for instance, dispatching an instruction, executing
a store, committing a result or annulling each uop after an exception),
it writes that event to a circular buffer that contains (by default)
the last 32768 events in chronological order (oldest to newest). This
is extremely useful for debugging in cases where you want to {}``look
backwards in time'' from the point where a specific but unknown {}``bad''
event occurred, but cannot leave logging at e.g. {}``\texttt{\footnotesize -loglevel
99}'' enabled all the time (because it is far too slow and space
consuming).

The event log ring buffer must be enabled via the \texttt{\footnotesize -ringbuf}
option. This is disabled by default since it exacts a 25-40\% performance
overhead (but this is much better than the 10000\%+ overhead of full
logging).

PTLsim will always print the ring buffer to the log file whenever:

\begin{itemize}
\item Any \texttt{\footnotesize assert} statement fails within the out of
order simulator core;
\item Any fatal exception occurs;
\item At user-specified points, by inserting {}``\texttt{\footnotesize core.eventlog.print(logfile);}''
anywhere within the code;
\item Whenever the {}``\texttt{\footnotesize -ringbuf-trigger-rip} \emph{rip}''
option is used to specify a specific trigger RIP. When the last uop
at this RIP is committed, the ring buffer is printed, exposing all
events that happened over the past few thousand cycles (going backwards
in time from the cycle in which the trigger instruction committed)
\item The event log ring buffer is automatically enabled whenever \texttt{\footnotesize -loglevel}
is 6 or higher; in this case all events are logged to the logfile
after every cycle.
\end{itemize}

\section{Simulation Start Points}

Normally PTLsim starts in simulation mode at the first instruction
in the target program (or the Linux dynamic linker, assuming the program
is dynamically linked). It may be desirable to skip time-consuming
initialization parts of the program, using one of two methods.

The \texttt{\footnotesize -startrip} \emph{rip} option places a breakpoint
at \emph{rip}, then immediately switches to native mode until that
breakpoint is hit, at which point PTLsim begins simulation.

Alternatively, if the source code to the program is available, it
may be recompiled with call(s) to a special function, \texttt{\footnotesize ptlcall\_switch\_to\_sim()},
provided in \texttt{\footnotesize ptlcalls.h}. PTLsim is then started
with the \texttt{\footnotesize -trigger} option, which switches it
to native mode until the first call to the \texttt{\footnotesize ptlcall\_switch\_to\_sim()}
function, at which point simulation begins. This function, and other
special code that can be used within the target program, is described
in Section \ref{sec:TriggerMode}.


\section{Simulation Stop Points}

By default, PTLsim continues executing in simulation mode until the
target program exits on its own. However, typically programs are profiled
for a fixed number of committed x86 instructions, or until a specific
point is reached, so as to ensure an identical span of instructions
is executed on every trial, without waiting for the entire program
to finish. The following options support this behavior:

\begin{itemize}
\item \texttt{\footnotesize -stopinsns} \emph{insns} will stop the simulation
after \emph{insns} x86 instructions have committed.
\item \texttt{\footnotesize -stop} \emph{cycles} stops after \emph{cycles}
cycles have been simulated.
\item \texttt{\footnotesize -stoprip} \emph{rip} stops after the instruction
at rip is decoded and executed the first time.
\end{itemize}
PTLsim will normally switch back to native mode after finishing simulation.
If the program should be terminated instead, the \texttt{\footnotesize -exitend}
option will do so.

The node is at the root of the statistics tree (typically this only
applies to the PTLsimStats structure itself)


\section{Statistics Collection}

PTLsim supports the collection of a wide variety of statistics and
counters as it simulates your code, and can make regular or triggered
snapshots of the counters. Chapter \ref{sec:StatisticsInfrastructure}
describes this support, while Section \ref{sec:StatisticsOptions}
documents the configuration options associated with statistics collection,
including \texttt{\footnotesize -stats}, \texttt{\footnotesize -snapshot-cycles},
\texttt{\footnotesize -snapshot-now}.

\begin{lyxcode}



\end{lyxcode}

\chapter{\label{sec:PTLsimInternals}PTLsim Classic Internals}


\section{\label{sec:Injection}Low Level Startup and Injection}

\emph{Note:} This section deals with the internal operation of the
PTLsim low level code, independent of the out of order simulation
engine. If you are only interested in modifying the simulator itself,
you can skip this section.

\emph{Note:} This section does not apply to the full system PTLsim/X;
please see the corresponding sections in Part \ref{sec:PTLsimFullSystem}
instead.


\subsection{\label{sub:Injection-On-x86-64}Startup on x86-64}

PTLsim is a very unusual Linux program. It does its own internal memory
management and threading without help from the standard libraries,
injects itself into other processes to take control of them, and switches
between 32-bit and 64-bit mode within a single process image. For
these reasons, it is very closely tied to the Linux kernel and uses
a number of undocumented system calls and features only available
in late 2.6 series kernels. 

PTLsim always starts and runs as a 64-bit process even when running
32-bit threads; it context switches between modes as needed. The statically
linked \texttt{\small ptlsim} executable begins executing at \texttt{\small ptlsim\_preinit\_entry}
in \texttt{\small lowlevel-64bit.S}. This code calls \texttt{\small ptlsim\_preinit()}
in \texttt{\small kernel.cpp} to set up our custom memory manager
and threading environment before any standard C/C++ functions are
used. After doing so, the normal \texttt{\small main()} function is
invoked.

The \texttt{\small ptlsim} binary can run in two modes. If executed
from the command line as a normal program, it starts up in \emph{inject}
mode. Specifically, \texttt{\small main()} in \texttt{\small ptlsim.cpp}
checks if the \texttt{\small inside\_ptlsim} variable has been set
by \texttt{\small ptlsim\_preinit\_entry}, and if not, PTLsim enters
inject mode. In this mode, \texttt{\small ptlsim\_inject()} in \texttt{\small kernel.cpp}
is called to effectively inject the \texttt{\small ptlsim} binary
into another process and pass control to it before even the dynamic
linker gets to load the program. In \texttt{\small ptlsim\_inject()},
the PTLsim process is forked and the child is placed under the parent's
control using \texttt{\small ptrace()}. The child process then uses
\texttt{\small exec()} to start the user program to simulate (this
can be either a 32-bit or 64-bit program). 

However, the user program starts in the stopped state, allowing \texttt{\small ptlsim\_inject()}
to use \texttt{\small ptrace()} and related functions to inject either
32-bit or 64-bit boot loader code directly into the user program address
space, overwriting the entry point of the dynamic linker. This code,
derived from \texttt{\small injectcode.cpp} (specifically compiled
as \texttt{\small injectcode-32bit.o} and \texttt{\small injectcode-64bit.o})
is completely position independent. Its sole function is to map the
rest of \texttt{\small ptlsim} into the user process address space
at virtual address \texttt{\small 0x70000000} and set up a special
\texttt{\small LoaderInfo} structure to allow the master PTLsim process
and the user process to communicate. The boot code also restores the
old code at the dynamic linker entry point after relocating itself.
Finally, \texttt{\small ptlsim\_inject()} adjusts the user process
registers to start executing the boot code instead of the normal program
entry point, and resumes the user process.

At this point, the PTLsim image injected into the user process exists
in a bizarre environment: if the user program is 32 bit, the boot
code will need to switch to 64-bit mode before calling the 64-bit
PTLsim entrypoint. Fortunately x86-64 and the Linux kernel make this
process easy, despite never being used by normal programs: a regular
far jump switches the current code segment descriptor to \texttt{\small 0x33},
effectively switching the instruction set to x86-64. For the most
part, the kernel cannot tell the difference between a 32-bit and 64-bit
process: as long as the code uses 64-bit system calls (i.e. \texttt{\small syscall}
instruction instead of \texttt{\small int 0x80} as with 32-bit system
calls), Linux assumes the process is 64-bit. There are some subtle
issues related to signal handling and memory allocation when performing
this trick, but PTLsim implements workarounds to these issues.

After entering 64-bit mode if needed, the boot code passes control
to PTLsim at \texttt{\small ptlsim\_preinit\_entry}. The \texttt{\small ptlsim\_preinit()}
function checks for the special \texttt{\small LoaderInfo} structure
on the stack and in the ELF header of PTLsim as modified by the boot
code; if these structures are found, PTLsim knows it is running inside
the user program address space. After setting up memory management
and threading, it captures any state the user process was initialized
with. This state is used to fill in fields in the global \texttt{\small ctx}
structure of class \texttt{\small CoreContext}: various floating point
related fields and the user program entry point and original stack
pointer are saved away at this point. If PTLsim is running inside
a 32-bit process, the 32-bit arguments, environment and kernel auxiliary
vector array (auxv) need to be converted to their 64-bit format for
PTLsim to be able to parse them from normal C/C++ code. Finally, control
is returned to \texttt{\small main()} to allow the simulator to start
up normally.


\subsection{Startup on 32-bit x86}

The PTLsim startup process on a 32-bit x86 system is essentially a
streamlined version of the process above (Section \ref{sub:Injection-On-x86-64}),
since there is no need for the same PTLsim binary to support both
32-bit and 64-bit user programs. The injection process is very similar
to the case where the user program is always a 32-bit program.


\section{Simulator Startup}

In \texttt{\footnotesize kernel.cpp}, the \texttt{\footnotesize main()}
function calls \texttt{\footnotesize init\_config()} to read in the
user program specific configuration as described in Sections \ref{sec:RunningPTLsim}
and \ref{sec:ConfigurationOptions}, then starts up the various other
simulator subsystems. If one of the \texttt{\footnotesize -excludeld}
or \texttt{\footnotesize -startrip} options were given, a breakpoint
is inserted at the RIP address where the user process should switch
from native mode to simulation mode (this may be at the dynamic linker
entry point by default).

Finally, \texttt{\footnotesize switch\_to\_native\_restore\_context()}
is called to restore the state that existed before PTLsim was injected
into the process and return to the dynamic linker entry point. This
may involve switching from 64-bit back to 32-bit mode to start executing
the user process natively as discussed in Section \ref{sec:Injection}.

After native execution reaches the inserted breakpoint thunk code,
the code performs a 32-to-64-bit long jump back into PTLsim, which
promptly restores the code underneath the inserted breakpoint thunk.
At this point, the \texttt{\footnotesize switch\_to\_sim()} function
in \texttt{\footnotesize kernel.cpp} is invoked to actually begin
the simulation. This is done by calling \texttt{\footnotesize simulate()}
in \texttt{\footnotesize ptlsim.cpp}.

At some point during simulation, the user program or the configuration
file may request a switch back to native mode for the remainder of
the program. In this case, the \texttt{\footnotesize switch\_to\_native\_restore\_context()}
function gets called to save the statistics data store, map the PTLsim
internal state back to the x86 compatible external state and return
to the 32-bit or 64-bit user code, effectively removing PTLsim from
the loop.

While the real PTLsim user process is running, the original PTLsim
injector process simply waits in the background for the real user
program with PTLsim inside it to terminate, then returns its exit
code.


\section{\label{sec:AddressSpaceSimulation}Address Space Simulation}

PTLsim maintains the \texttt{\footnotesize AddressSpace} class as
global variable \texttt{\footnotesize asp} (see \texttt{\footnotesize kernel.cpp})
to track the attributes of each page within the virtual address space.
When compiled for x86-64 systems, PTLsim uses Shadow Page Access Tables
(SPATs), which are essentially large two-level bitmaps. Since pages
are 4096 bytes in size, each 64 kilobyte chunk of the bitmap can track
2 GB of virtual address space. In each SPAT, each top level array
entry points to a chunk mapping 2 GB, such that with 131072 top level
pointers, the full 48 bit virtual address space can typically be mapped
with under a megabyte of SPAT chunks, assuming the address space is
sparse.

When compiled for 32-bit x86 systems, each SPAT is just a 128 KByte
bitmap, with one bit for each of the 1048576 4 KB pages in the 4 GB
address space.

In the AddressSpace structure, there are separate SPAT tables for
readable pages (\texttt{\footnotesize readmap} field), writable pages
(\texttt{\footnotesize writemap} field) and executable pages (\texttt{\footnotesize execmap}
field). Two additional SPATs, \texttt{\footnotesize dtlbmap} and \texttt{\footnotesize itlbmap},
are used to track which pages are currently mapped by the simulated
translation lookaside buffers (TLBs); this is discussed further in
Section \ref{sec:TranslationLookasideBuffers}.

When running in native mode, PTLsim cannot track changes to the process
memory map made by native calls to \texttt{\footnotesize mmap()},
\texttt{\footnotesize munmap()}, etc. Therefore, at every switch from
native to simulation mode, the \texttt{\footnotesize resync\_with\_process\_maps()}
function is called. This function parses the \texttt{\footnotesize /proc/self/maps}
metafile maintained by the kernel to build a list of all regions mapped
by the current process. Using this list, the SPATs are rebuilt to
reflect the current memory map. This is absolutely critical for correct
operation, since during simulation, speculative loads and stores will
only read and write memory if the appropriate SPAT indicates the address
is accessible to user code. If the SPATs become out of sync with the
real memory map, PTLsim itself may crash rather than simply marking
the offending load or store as invalid. The \texttt{\footnotesize resync\_with\_process\_maps()}
function (or more specifically, the \texttt{\footnotesize mqueryall()}
helper function) is fairly kernel version specific since the format
of \texttt{\footnotesize /proc/self/maps} has changed between Linux
2.6.x kernels. New kernels may require updating this function.


\section{\label{sec:DebuggingHints}Debugging Hints}

When adding or modifying PTLsim, bugs will invariably crop up. Fortunately,
PTLsim provides a trivial way to find the location of bugs which silently
corrupt program execution. Since PTLsim can transparently switch between
simulation and native mode, isolating the divergence point between
the simulated behavior and what a real reference machine would do
can be done through binary search. The \texttt{\footnotesize -stopinsns}
configuration option can be set to stop simulation before the problem
occurs, then incremented until the first x86 instruction to break
the program is determined.

The out of order simulator (\texttt{\footnotesize ooocore.cpp}) includes
extensive debugging and integrity checking assertions. These may be
turned off by default for improved performance, but they can be easily
re-enabled by defining the \texttt{\footnotesize ENABLE\_CHECKS} symbol
at the top of \texttt{\footnotesize ooocore.cpp}, \texttt{\footnotesize ooopipe.cpp}
and \texttt{\footnotesize oooexec.cpp}. Additional check functions
are in the code but commented out; these may be used as well.

You can also debug PTLsim with \texttt{\small gdb}, although the process
is non-standard due to PTLsim's co-simulation architecture:

\begin{itemize}
\item Start PTLsim on the target program like normal. Notice the \texttt{\footnotesize Thread}{\footnotesize{}
}\texttt{\emph{\footnotesize N}}{\footnotesize{} }\texttt{\footnotesize is
running in XX-bit mode} message printed at startup: this is the PID
you will be debugging, not the {}``\texttt{\small ptlsim}'' process
that may also be running.
\item Start GDB and type {}``\texttt{\footnotesize attach 12345}'' if
\emph{12345} was the PID listed above
\item Type {}``\texttt{\footnotesize symbol-file ptlsim}'' to load the
PTLsim internal symbols (otherwise gdb only knows about the benchmark
code itself). You should specify the full path to the PTLsim executable
here.
\item You're now debugging PTLsim. If you run the {}``\texttt{\small bt}''
command to get a backtrace, it should show the PTLsim functions starting
at address 0x70000000.
\end{itemize}
If the backtrace does not display enough information, go to the \texttt{\footnotesize Makefile}
and enable the \char`\"{}no optimization\char`\"{} options (the \char`\"{}-O0\char`\"{}
line instead of \char`\"{}-O99\char`\"{}) since that will make more
debugging information available to you.

The {}``\texttt{\footnotesize -pause-at-startup} \emph{seconds}''
configuration option may be useful here, to give you time to attach
with a debugger before starting the simulation.


\section{\label{sec:Timing}Timing Issues}

PTLsim uses the \texttt{\footnotesize CycleTimer} class extensively
to gather data about its own performance using the CPU's timestamp
counter. At startup in \texttt{\small superstl.cpp}, the CPU's maximum
frequency is queried from the appropriate Linux kernel sysfs node
(if available) or from \texttt{\small /proc/cpuinfo} if not. Processors
which dynamically scale their frequency and voltage in response to
load (like all Athlon 64 and K8 based AMD processors) require special
handling. It is assumed that the processor will be running at its
maximum frequency (as reported by sysfs) or a fixed frequency (as
reported by \texttt{\small /proc/cpuinfo}) throughout the majority
of the simulation time; otherwise the timing results will be bogus.


\section{External Signals and PTLsim}

PTLsim can be forced to switch between native mode and sequential
mode by sending it standard Linux-style signals from the command line.
If your program is called {}``myprogram'', start it under PTLsim
and run this command from another terminal:

\begin{lyxcode}
{\footnotesize killall~-XCPU~}\emph{\footnotesize myprogram}{\footnotesize \par}
\end{lyxcode}
This will force PTLsim to switch between native mode and simulation
mode, depending on its current mode. It will print a message to the
console and the logfile when you do this. The initial mode (native
or simulation) is determined by the presence of the \texttt{\footnotesize -trigger}
option: with \texttt{\footnotesize -trigger}, the program starts in
native mode until the trigger point (if any) is reached.


\part{\label{sec:PTLsimFullSystem}PTLsim/X: Full System SMP/SMT Simulation}


\chapter{Background}


\section{Virtual Machines and Full System Simulation}

Full system simulation and virtualization has been around since the
dawn of computers. Typically \emph{virtual machine} software is used
to run \emph{guest} operating systems on a physical \emph{host} system,
such that the guest believes it is running directly on the bare hardware.
Modern full system simulators in the x86 world can be roughly divided
into two groups (this paper does not consider systems for other instruction
sets).

\emph{Hypervisors} execute most unprivileged instructions on the native
CPU at full speed, but trap privileged instructions used by the operating
system kernel, where they are emulated by hypervisor software so as
to maintain isolation between virtual machines and make the virtual
machine nearly indistinguishable from the real CPU. In some cases
(particularly on x86), additional software techniques are needed to
fully hide the hypervisor from the guest OS.

\begin{itemize}
\item \emph{Xen} \cite{Xen2Overview,Xen3,XenCambridge,XenIntroWiki,XenPerformance,XenSource}
represents the current state of the art in this field; it will be
described in great detail later on.
\item \emph{VMware} \cite{VMware} is a very well known commercial product
that allows unmodified x86 operating systems to run inside a virtual
machine. Because the x86 instruction set is not fully virtualizable,
VMware must employ x86-to-x86 binary translation techniques on kernel
code (but not user mode code) to make the virtual CPU indistinguishable
from the real CPU for compatibility reasons. These translations are
typically cached in a hidden portion of the guest address space to
improve performance compared to simply interpreting sensitive x86
instructions. While this approach is sophisticated and effective,
it exacts a heavy performance penalty on I/O intensive workloads \cite{XenPerformance}.
Interestingly, the latest microprocessors from Intel and AMD include
hardware features (Intel VT \cite{Intel-VT}, AMD SVM \cite{AMD-SVM})
to eliminate the binary translation and patching overhead. Xen fully
supports these technologies to allow running Windows and other OS's
at full speed, while VMware has yet to include full support.


VMware comes in two flavors. ESX is a true hypervisor that boots on
the bare hardware underneath the first guest OS. GSX and Workstation
use a userspace frontend process containing all virtual device drivers
and the binary translator, while the \emph{vmmon} kernel module (open
source in the Linux version) handles memory virtualization and context
switching tasks similar to Xen.

\item Several other products, including Virtual PC and Parallels, provide
features similar to VMware using similar technology.
\item \emph{KVM} (Kernel Virtual Machine) is a new hypervisor infrastructure
built into all Linux kernels after 2.6.19. It depends on the hardware
virtualization extensions (Intel VT and AMD SVM) built into modern
x86 chips, whereas Xen and VMware also support running on older processors
without special hardware support. KVM is an attractive foundation
for future virtual machine development since it's built into Linux
(so it requires far less setup work than Xen or VMware) and provides
excellent performance.
\end{itemize}
Unlike hypervisors, \emph{simulators} perform cycle accurate execution
of x86 instructions using interpreter software, without running any
guest instructions on the native CPU.

\begin{itemize}
\item \emph{Bochs} \cite{Bochs} is the most well known open source x86
simulator; it is considered to be a nearly RTL (register transfer
language) level description of every x86 behavior from legacy 16-bit
features up through modern x86-64 instructions. \emph{Bochs} is very
useful for the functional validation of real x86 microprocessors,
but it is very slow (around 5-10 MHz equivalent) and is not useful
for implementing cycle accurate models of modern uop-based out of
order x86 processors (for instance, it does not model caches, memory
latency, functional units and so on).
\item \emph{QEMU} \cite{QEMU} is similar in purpose to VMware, but unlike
VMware, it supports multiple CPU host and guest architectures (PowerPC,
SPARC, ARM, etc). QEMU uses binary translation technology similar
to VMware to hide the hypervisor's presence from the guest kernel.
However, due to its cross platform design, both kernel and user code
is passed through x86-to-x86 binary translation (even on x86 platforms)
and stored in a translation cache. Interestingly, Xen uses a substantial
amount of QEMU code to model common hardware devices when running
unmodified operating systems like Windows, but Xen still uses its
own hardware-assisted technology to actually achieve virtualization.
QEMU supports a proprietary hypervisor module to add VMware's and
Xen's ability to run user mode code natively on the CPU to reduce
the performance penalty; hence it is also in the hypervisor category.
\item \emph{Simics} \cite{Simics} is a commercial simulation suite for
modeling both the functional aspects of various x86 processors (including
vendor specific extensions) as well as user-designed plug-in models
of real hardware devices. It is used extensively in industry for modeling
new hardware and drivers, as well as firmware level debugging. Like
QEMU, Simics uses x86-to-x86 binary translation to instrument code
at a very low level while achieving good performance (though noticeably
slower than a hypervisor provides). Unlike QEMU, Simics is fully extensible
and supports a huge range of real hardware models, but it is not possible
to add cycle accurate simulation features below the x86 instruction
level, making it less useful to microarchitects (both because of technical
considerations as well as its status as a closed source product).
\item \emph{SimNow} \cite{SimNow} is an AMD simulation tool used during
the design and validation of AMD's x86-64 hardware. Like Simics, it
is a functional simulator only, but it models a variety of AMD-built
hardware devices. SimNow uses x86-to-x86 binary translation technology
similar to Simics and QEMU to achieve good performance. Because SimNow
does not provide cycle accurate timing data, AMD uses its own TSIM
trace-based simulator, derived from the K8 RTL, to do actual validation
and timing studies. SimNow is available for free to the public, albeit
as closed source.
\end{itemize}
All of these tools share one common disadvantage: they are unable
to model execution at a level below the granularity of x86 instructions,
making them unsuitable to microarchitects. PTLsim/X seeks to fill
this void by allowing extremely detailed uop-level cycle accurate
simulation of x86 and x86-64 microprocessor cores, while simultaneously
delivering all the performance benefits of true native-mode hypervisors
like Xen, selective binary translation based hypervisors like VMware
and QEMU, and the detailed hardware modeling capabilities of Bochs
and Simics.


\section{Xen Overview}

Xen \cite{Xen3,Xen2Overview,XenCambridge,XenIntroWiki,XenPerformance,XenSource}
is an open source x86 virtual machine monitor, also known as a \emph{hypervisor}.
Each virtual machine is called a {}``domain'', where domain 0 is
privileged and accesses all hardware 