%% LyX 1.5.1 created this file.  For more info, see http://www.lyx.org/.
%% Do not edit unless you really know what you are doing.
\documentclass[12pt,english]{report}
\usepackage{mathptmx}
\usepackage{helvet}
\renewcommand{\ttdefault}{cmtt}
\usepackage[T1]{fontenc}
\usepackage[latin9]{inputenc}
\usepackage{geometry}
\geometry{verbose,letterpaper,tmargin=1in,bmargin=1in,lmargin=1in,rmargin=1in,headheight=0in,headsep=0in,footskip=0.25in}
\setcounter{secnumdepth}{3}
\setcounter{tocdepth}{3}
\setlength{\parskip}{\medskipamount}
\setlength{\parindent}{0pt}
\usepackage{array}
\usepackage{fancybox}
\usepackage{calc}
\usepackage{setspace}

\makeatletter

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands.
\newcommand{\lyxline}[1][1pt]{%
  \par\noindent%
  \rule[.5ex]{\linewidth}{#1}\par}
%% Bold symbol macro for standard LaTeX users
\providecommand{\boldsymbol}[1]{\mbox{\boldmath $#1$}}

%% Because html converters don't know tabularnewline
\providecommand{\tabularnewline}{\\}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Textclass specific LaTeX commands.
\newenvironment{lyxcode}
{\begin{list}{}{
\setlength{\rightmargin}{\leftmargin}
\setlength{\listparindent}{0pt}% needed for AMS classes
\raggedright
\setlength{\itemsep}{0pt}
\setlength{\parsep}{0pt}
\normalfont\ttfamily}%
 \item[]}
{\end{list}}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands.
\usepackage[pdftitle={PTLsim User's Guide and Reference},colorlinks=true,linkcolor=blue,anchorcolor=blue,citecolor=blue,urlcolor=blue]{hyperref}

\usepackage{babel}
\makeatother

\begin{document}
\noindent \begin{center}
\textbf{\huge ~}
\par\end{center}{\huge \par}

\vfill{}


\noindent \begin{center}
\textsf{\textbf{\Huge PTLsim User's Guide and Reference}}
\par\end{center}{\Huge \par}

\noindent \begin{center}
\emph{\huge The Anatomy of an x86-64 Out of Order}\\
\emph{\huge Superscalar Microprocessor}
\par\end{center}{\huge \par}

\bigskip{}


\noindent \begin{center}
{\LARGE Matt T. Yourst}\\
\texttt{\large <yourst@yourst.com>}
\par\end{center}{\large \par}

\noindent \begin{center}
Revision 20070923\\
\textbf{\emph{Second Edition}}
\par\end{center}

\vfill{}


\noindent \begin{center}
The latest version of PTLsim and this document are always available
at:\textsf{\textbf{\LARGE }}\\
\textsf{\textbf{\LARGE }}\\
\textsf{\textbf{\LARGE www.ptlsim.org}}
\par\end{center}{\LARGE \par}

\bigskip{}


\noindent \begin{center}
$\copyright$ 2007 Matt T. Yourst \texttt{\small <yourst@yourst.com>}.
\par\end{center}

\noindent \begin{center}
The PTLsim software and manual are free software;\\
they are licensed under the GNU General Public License version 2.
\par\end{center}

\tableofcontents{}


\part{\label{part:Introduction}PTLsim User's Guide}


\chapter{Introducing PTLsim}


\section{Introducing PTLsim}

\textbf{PTLsim} is a state of the art cycle accurate microprocessor
simulator and virtual machine for the x86 and x86-64 instruction sets.
PTLsim models a modern superscalar out of order x86-64 compatible
processor core at a configurable level of detail ranging from full-speed
native execution on the host CPU all the way down to RTL level models
of all key pipeline structures. In addition, the complete cache hierarchy,
memory subsystem and supporting hardware devices are modeled with
true cycle accuracy. PTLsim supports the full x86-64 instruction set
of the Pentium 4+, Athlon 64 and similar machines with all extensions
(x86-64, SSE/SSE2/SSE3, MMX, x87). It is currently the only tool available
to the public to support true cycle accurate modeling of real x86
microarchitectures.

PTLsim is very different from most cycle accurate simulators. Because
it runs directly on the same platform it is simulating (an x86 or
x86-64 machine, typically running Linux), it is able to switch in
and out of full out of order simulation mode and native x86 or x86-64
mode at any time completely transparent to the running user code.
This lets users quickly profile a small section of the user code without
the overhead of emulating the uninteresting parts, and enables automatic
debugging by finding the divergence point between a real reference
machine and the simulation.

PTLsim comes in two flavors. The classic version runs any 32-bit or
64-bit single threaded userspace Linux application. We have successfully
run a wide array of programs under PTLsim, from typical benchmarks
to graphical applications and network servers.

PTLsim/X runs on the bare hardware and integrates with Xen hypervisor,
allowing it to provide full system x86-64 simulation, multi-processor
and multi-threading support (SMT and multi-core models), checkpoints,
cycle accurate virtual device timing models, deterministic time dilation,
and much more, all without sacrificing the speed and accuracy inherent
in PTLsim's design. PTLsim/X makes it possible to run any Xen-compatible
operating system under simulation; we have successfully booted arbitrary
Linux distributions and industry standard applications and benchmarks
under PTLsim/X.

Compared to competing simulators, PTLsim provides extremely high performance
even when running in full cycle accurate out of order simulation mode.
Through extensive tuning, cache profiling and the use of x86 specific
accelerated vector operations and instructions, PTLsim significantly
cuts simulation time compared to traditional research simulators.
Even with its optimized core, PTLsim still allows a significant amount
of flexibility for easy experimentation through the use of optimized
C++ template classes and libraries suited to synchronous logic design.


\section{History}

PTLsim was designed and developed by Matt T. Yourst \texttt{\footnotesize <yourst@yourst.com>}
with its beginnings dating back to 2001. The main PTLsim code base,
including the out of order processor model, has been in active development
since 2003 and has been used extensively by our processor design research
group at the State University of New York at Binghamton in addition
to hundreds of major universities, industry research labs and several
well known microprocessor vendors.

PTLsim is not related to other legacy simulators. It is our hope that
PTLsim will help microprocessor researchers move to a contemporary
and widely used instruction set (x86 and x86-64) with readily available
hardware implementations. This will provide a new option for researchers
stuck with simulation tools supporting only the Alpha or MIPS based
instruction sets, both of which have since been discontinued on real
commercially available hardware (making co-simulation impossible)
with an uncertain future in up to date compiler toolchains.

The PTLsim software and this manual are free software, licensed under
the GNU General Public License version 2.


\chapter{Getting Started}


\section{Documentation Map}

This manual has been divided into several parts:

\begin{itemize}
\item Part \ref{part:Introduction} introduces PTLsim, reviews the x86 architecture,
and describes PTLsim's implementation of x86 in terms of uops, microcode
and internal structures.
\item Part \ref{sec:PTLsimClassic} describes the use and implementation
of userspace PTLsim.

\begin{itemize}
\item If you simply want to \emph{use} PTLsim, this part starts with an
easy to follow \textbf{tutorial}
\end{itemize}
\item Part \ref{sec:PTLsimFullSystem} describes the use and implementation
of full system PTLsim/X.

\begin{itemize}
\item If you simply want to \emph{use} full system PTLsim/X, this part starts
with an easy to follow \textbf{tutorial}
\end{itemize}
\item Part \ref{part:OutOfOrderModel} details the design and implementation
of the PTLsim out of order superscalar core model

\begin{itemize}
\item Read this part if you want to understand and modify PTLsim's out of
order core.
\end{itemize}
\item Part \ref{part:Appendices} is a reference manual for the PTLsim internal
uop instruction set, the performance monitoring events the simulator
supports and a variety of other technical information.
\end{itemize}

\section{Additional Resources}

The latest version of PTLsim and this document are always available
at the PTLsim web site:

\begin{quote}
\textsf{\textbf{\large http://www.ptlsim.org}}{\large \par}
\end{quote}

\chapter{PTLsim Architecture}


\chapter{\label{sec:PTLsimCodeBase}PTLsim Code Base}


\section{Code Base Overview}

PTLsim is written in C++ with extensive use of x86 and x86-64 inline
assembly code. It must be compiled with gcc on a Linux 2.6 based x86
or x86-64 machine. The C++ variant used by PTLsim is known as Embedded
C++. Essentially, we only use the features found in C, but add templates,
classes and operator overloading. Other C++ features such as hidden
side effects in constructors, exception handling, RTTI, multiple inheritance,
virtual methods (in most cases), thread local storage and so on are
forbidden since they cannot be adequately controlled in the embedded
{}``bare hardware'' environment in which PTLsim runs, and can result
in poor performance. We have our own standard template library, SuperSTL,
that must be used in place of the C++ STL.

Even though the PTLsim code base is very large, it is well organized
and structured for extensibility. The following section is an overview
of the source files and subsystems in PTLsim:

\begin{itemize}
\item \textbf{PTLsim Core Subsystems:}

\begin{itemize}
\item \texttt{\textbf{\small ptlsim.cpp}} and \texttt{\textbf{\small ptlsim.h}}
are responsible for general top-level PTLsim tasks and starting the
appropriate simulation core code.
\item \texttt{\textbf{\small uopimpl.cpp}} contains implementations of all
uops and their variations. PTLsim implements most ALU and floating
point uops in assembly language so as to leverage the exact semantics
and flags generated by real x86 instructions, since most PTLsim uops
are so similar to the equivalent x86 instructions. When compiled on
a 32-bit system, some of the 64-bit uops must be emulated using slower
C++ code.
\item \texttt{\textbf{\small ptlhwdef.cpp}} and \texttt{\textbf{\small ptlhwdef.h}}
define the basic uop encodings, flags and registers. The tables of
uops might be interesting to see how a modern x86 processor is designed
at the microcode level. The basic format is discussed in Section \ref{sec:UopIntro};
all uops are documented in Section \ref{sec:UopReference}.
\item \texttt{\textbf{\small seqcore.cpp}} implements the sequential in-order
core. This is a strictly functional core, without data caches, branch
prediction and so forth. Its purpose is to provide fast execution
of the raw uop stream and debugging of issues with the decoder, microcode
or virtual hardware rather than a specific core model.
\end{itemize}
\item \textbf{Decoder, Microcode and Basic Block Cache:}

\begin{itemize}
\item \texttt{\textbf{\small decode-core.cpp}}{\small{} }coordinates the
translation from x86 and x86-64 into uops, maintains the basic block
cache and handles self modifying code, invalidation and other x86
specific complexities.
\item \texttt{\textbf{\small decode-fast.cpp}} decodes the subset of the
x86 instruction set used by 95\% of all instructions with four or
fewer uops. It should be considered the {}``fast path'' decoder
in a hardware microprocessor.
\item \texttt{\textbf{\small decode-complex.cpp}} decodes complex instructions
into microcode, and provides most of the assists (microcode subroutines)
required by x86 machines.
\item \texttt{\textbf{\small decode-sse.cpp}} decodes all SSE, SSE2, SSE3
and MMX instructions
\item \texttt{\textbf{\small decode-x87.cpp}} decodes x87 floating point
instructions and provides the associated microcode
\item \texttt{\textbf{\small decode.h}} contains definitions of the above
functions and classes.
\end{itemize}
\item \textbf{Out Of Order Core:}

\begin{itemize}
\item \texttt{\textbf{\small ooocore.cpp}} is the out of order simulator
control logic. The microarchitectural model implemented by this simulator
is the subject of Part \ref{part:OutOfOrderModel}.
\item \texttt{\textbf{\small ooopipe.cpp}} implements the discrete pipeline
stages (frontend and backend) of the out of order model.
\item \texttt{\textbf{\small oooexec.cpp}} implements all functional units,
load/store units and issue queue and replay logic
\item \texttt{\textbf{\small ooocore.h}} defines most of the configurable
parameters for the out of order core not intrinsic to the PTLsim uop
instruction set itself.
\item \texttt{\textbf{\small dcache.cpp}} and \texttt{\textbf{\small dcache.h}}
contain the data cache model. At present the full L1/L2/L3/mem hierarchy
is modeled, along with miss buffers, load fill request queues, ITLB/DTLB
and bus interfaces. The cache hierarchy is very flexible configuration
wise; it is described further in Section \ref{sec:CacheHierarchy}.
\item \texttt{\textbf{\small branchpred.cpp}} and \texttt{\textbf{\small branchpred.h}}
is the branch predictor. By default, this is set up as a hybrid bimodal
and history based predictor with various customizable parameters.
\end{itemize}
\item \textbf{Linux Hosted Kernel Interface:}

\begin{itemize}
\item \texttt{\textbf{\small kernel.cpp}} and \texttt{\textbf{\small kernel.h}}
is where all the virtual machine \char`\"{}black magic\char`\"{} takes
place to let PTLsim transparently switch between simulation and native
mode and 32-bit/64-bit mode (or only 32-bit mode on a 32-bit x86 machine).
In general you should not need to touch this since it is very Linux
kernel specific and works at a level below the standard C/C++ libraries.
\item \texttt{\textbf{\small lowlevel-64bit.S}} contains 64-bit startup
and context switching code. PTLsim execution starts here if run on
an x86-64 system.
\item \texttt{\textbf{\small lowlevel-32bit.S}} contains 32-bit startup
and context switching code. PTLsim execution starts here if run on
a 32-bit x86 system.
\item \texttt{\textbf{\small injectcode.cpp}} is compiled into the 32-bit
and 64-bit code injected into the target process to map the \texttt{\small ptlsim}
binary and pass control to it.
\item \texttt{\textbf{\small loader.h}} is used to pass information to the
injected boot code.
\end{itemize}
\item \textbf{PTLsim/X Bare Hardware and Xen Interface:}

\begin{itemize}
\item \texttt{\textbf{\small ptlxen.cpp}} brings up PTLsim on the bare hardware,
dispatches traps and interrupts, virtualizes Xen hypercalls, communicates
via DMA with the PTLsim monitor process running in the host domain
0 and otherwise serves as the kernel of PTLsim's own mini operating
system.
\item \texttt{\textbf{\small ptlxen-memory.cpp}} is responsible for all
page based memory operations within PTLsim. It manages PTLsim's own
internal page tables and its physical memory map, and services page
table walks, parts of the x86 microcode and memory-related Xen hypercalls.
\item \texttt{\textbf{\small ptlxen-events.cpp}} provides all interrupt
(VIRQ) and event handling, manages PTLsim's time dilation technology,
and provides all time and event related hypercalls.
\item \texttt{\textbf{\small ptlxen-common.cpp}} provides common functions
used by both PTLsim itself and PTLmon.
\item \texttt{\textbf{\small ptlxen.h}} provides inline functions and defines
related to full system PTLsim/X.
\item \texttt{\textbf{\small ptlmon.cpp}} provides the PTLsim monitor process,
which runs in domain 0 and interfaces with the PTLsim hypervisor code
inside the target domain to allow it to communicate with the outside
world. It uses a client/server architecture to forward control commands
to PTLsim using DMA and Xen hypercalls.
\item \texttt{\textbf{\small xen-types.h}} contains Xen-specific type definitions
\item \texttt{\textbf{\small ptlsim-xen-hypervisor.diff}} and \texttt{\textbf{\small ptlsim-xen-tools.diff}}
are patches that must be applied to the Xen hypervisor source tree
and the Xen userspace tools, respectively, to allow PTLsim to be injected
into domains.
\item \texttt{\textbf{\small ptlxen.lds}} and \texttt{\textbf{\small ptlmon.lds}}
are linker scripts used to lay out the memory image of PTLsim and
PTLmon.
\item \texttt{\textbf{\small lowlevel-64bit-xen.S}} contains the PTLsim/X
boot code, interrupt handling and exception handling
\item \texttt{\textbf{\footnotesize ptlctl.cpp}} is a utility used within
a domain under simulation to control PTLsim
\item \texttt{\textbf{\footnotesize ptlcalls.h}} provides a library of functions
used by code within the target domain to control PTLsim.
\end{itemize}
\item \textbf{Support Subsystems:}

\begin{itemize}
\item \texttt{\textbf{\small superstl.h}}, \texttt{\textbf{\small superstl.cpp}}
and \texttt{\textbf{\small globals.h}} implement various standard
library functions and classes as an alternative to C++ STL. These
libraries also contain a number of features very useful for bit manipulation.
\item \texttt{\textbf{\small logic.h}} is a library of C++ templates for
implementing synchronous logic structures like associative arrays,
queues, register files, etc. It has some very clever features like
\texttt{\small FullyAssociativeArray8bit}, which uses x86 SSE vector
instructions to associatively match and process \textasciitilde{}16
byte-sized tags every cycle. These classes are fully parameterized
and useful for all kinds of simulations.
\item \texttt{\textbf{\small mm.cpp}} is the PTLsim custom memory manager.
It provides extremely fast memory allocation functions based on multi-threaded
slab caching (the same technique used inside Linux itself) and extent
allocation, along with a traditional physical page allocator. The
memory manager also provides PTLsim's garbage collection system, used
to discard unused or least recently used objects when allocations
fail.
\item \texttt{\textbf{\small mathlib.cpp}} and \texttt{\textbf{\small mathlib.h}}
provide standard floating point functions suitable for embedded systems
use. These are used heavily as part of the x87 microcode.
\item \texttt{\textbf{\small klibc.cpp}} and \texttt{\textbf{\small klibc.h}}
provide standard libc-like library functions suitable for use on the
bare hardware
\item \texttt{\textbf{\small syscalls.cpp}} and \texttt{\textbf{\small syscalls.h}}
declare all Linux system call stubs. This is also used by PTLsim/X,
which emulates some Linux system calls to make porting easier.
\item \texttt{\textbf{\small config.cpp}} and \texttt{\textbf{\small config.h}}
manage the parsing of configuration options for each user program.
This is a general purpose library used by both PTLsim itself and the
userspace tools (PTLstats, etc)
\item \texttt{\textbf{\small datastore.cpp}} and \texttt{\textbf{\small datastore.h}}
manage the PTLsim statistics data store file structure.
\end{itemize}
\item \textbf{Userspace Tools:}

\begin{itemize}
\item \texttt{\textbf{\small ptlstats.cpp}} is a utility for printing and
analyzing the statistics data store files in various human readable
ways.
\item \texttt{\textbf{\small dstbuild}} is a Perl script used to parse stats.h
and generate the datastore template (Section \ref{sec:StatisticsInfrastructure})
\item \texttt{\textbf{\small makeusage.cpp}} is used to capture the usage
text (help screen) for linking into PTLsim
\item \texttt{\textbf{\small cpuid.cpp}} is a utility program to show various
data returned by the x86 \texttt{\small cpuid} instruction. Run it
under PTLsim for a surprise.
\item \texttt{\textbf{\small glibc.cpp}} contains miscellaneous userspace
functions
\item \texttt{\textbf{\small ptlcalls.c}} and \texttt{\textbf{\small ptlcalls.h}}
are optionally compiled into user programs to let them switch into
and out of simulation mode on their own. The \texttt{\textbf{\small ptlcalls.o}}
file is typically linked with Fortran programs that can't use regular
C header files.
\end{itemize}
\end{itemize}

\section{Common Libraries and Logic Design APIs}

PTLsim includes a number of powerful C++ templates, macros and functions
not found anywhere else. This section attempts to provide an overview
of these structures so that users of PTLsim will use them instead
of trying to duplicate work we've already done.


\subsection{General Purpose Macros}

The file \texttt{\small globals.h} contains a wide range of very useful
definitions, functions and macros we have accumulated over the years,
including:

\begin{itemize}
\item Basic data types used throughout PTLsim (e.g. \texttt{\footnotesize W64}
for 64-bit words, \texttt{\footnotesize Waddr} for words the same
size as pointers, and so on)
\item Type safe C++ template based functions, including \texttt{\footnotesize min},
\texttt{\footnotesize max}, \texttt{\footnotesize abs}, \texttt{\footnotesize mux},
etc.
\item Iterator macros (\texttt{\footnotesize foreach}) 
\item Template based metaprogramming functions including \texttt{\footnotesize lengthof}
(finds the length of any static array), \texttt{\footnotesize offsetof}{\footnotesize{}
}(offset of member in structure), \texttt{\footnotesize baseof} (member
to base of structure), and \texttt{\footnotesize log2} (takes the
base-2 log of any constant at compile time)
\item Floor, ceiling and masking functions for integers and powers of two
(\texttt{\footnotesize floor}, \texttt{\footnotesize trunc}, \texttt{\footnotesize ceil},
\texttt{\footnotesize mask}, \texttt{\footnotesize floorptr}, \texttt{\footnotesize ceilptr},
\texttt{\footnotesize maskptr}, \texttt{\footnotesize signext}, etc)
\item Bit manipulation macros (\texttt{\footnotesize bit}, \texttt{\footnotesize bitmask},
\texttt{\footnotesize bits}, \texttt{\footnotesize lowbits}, \texttt{\footnotesize setbit},
\texttt{\footnotesize clearbit}, \texttt{\footnotesize assignbit}).
Note that the \texttt{\footnotesize bitvec} template (see below) should
be used in place of these macros wherever it is more convenient.
\item Comparison functions (\texttt{\footnotesize aligned}, \texttt{\footnotesize strequal},
\texttt{\footnotesize inrange}, \texttt{\footnotesize clipto})
\item Modulo arithmetic (\texttt{\footnotesize add\_index\_modulo}, \texttt{\footnotesize modulo\_span},
et al)
\item Definitions of basic x86 SSE vector functions (e.g. \texttt{\footnotesize x86\_cpu\_pcmpeqb}{\footnotesize{}
}et al)
\item Definitions of basic x86 assembly language functions (e.g. \texttt{\footnotesize x86\_bsf64}
et al)
\item A full suite of bit scanning functions (\texttt{\footnotesize lsbindex},
\texttt{\footnotesize msbindex}, \texttt{\footnotesize popcount} et
al)
\item Miscellaneous functions (\texttt{\footnotesize arraycopy}, \texttt{\footnotesize setzero},
etc)
\end{itemize}

\subsection{Super Standard Template Library (SuperSTL)}

The Super Standard Template Library (SuperSTL) is an internal C++
library we use internally in lieu of the normal C++ STL for various
technical and preferential reasons. While the full documentation is
in the comments of \texttt{\small superstl.h} and \texttt{\small superstl.cpp},
the following is a brief list of its features:

\begin{itemize}
\item I/O stream classes familiar from Standard C++, including \texttt{\footnotesize istream}
and \texttt{\footnotesize ostream}. Unique to SuperSTL is how the
comma operator ({}``,'') can be used to separate a list of objects
to send to or from a stream, in addition to the usual C++ insertion
operator ({}``<\,{}<'').
\item To read and write binary data, the \texttt{\small idstream} and \texttt{\small odstream}
classes should be used instead.
\item String buffer (\texttt{\footnotesize stringbuf}) class for composing
strings in memory the same way they would be written to or read from
an \texttt{\footnotesize ostream} or \texttt{\footnotesize istream}.
\item String formatting classes (\texttt{\footnotesize intstring}, \texttt{\footnotesize hexstring},
\texttt{\footnotesize padstring}, \texttt{\footnotesize bitstring},
\texttt{\footnotesize bytemaskstring}, \texttt{\footnotesize floatstring})
provide a wrapper around objects to exercise greater control of how
they are printed.
\item Array (\texttt{\footnotesize array}) template class represents a fixed
size array of objects. It is essentially a simple but very fast wrapper
for a C-style array.
\item Bit vector (\texttt{\footnotesize bitvec}) is a heavily optimized
and rewritten version of the Standard C++ \texttt{\footnotesize bitset}
class. It supports many additional operations well suited to logic
design purposes and emphasizes extremely fast branch free code.
\item Dynamic Array (\texttt{\footnotesize dynarray}) template class provides
for dynamically sized arrays, stacks and other such structures, similar
to the Standard C++ \texttt{\small valarray} class.
\item Linked list node (\texttt{\footnotesize listlink}) template class
forms the basis of double linked list structures in which a single
pointer refers to the head of the list.
\item Queue list node (\texttt{\small queuelink}) template class supports
more operations than \texttt{\small listlink} and can serve as both
a node in a list and a list head/tail header.
\item Index reference (\texttt{\small indexref}) is a smart pointer which
compresses a full pointer into an index into a specific structure
(made unique by the template parameters). This class behaves exactly
like a pointer when referenced, but takes up much less space and may
be faster. The \texttt{\small indexrefnull} class adds support for
storing null pointers, which \texttt{\small indexref} lacks.
\item \texttt{\footnotesize Hashtable} class is a general purpose chaining
based hash table with user configurable key hashing and management
via add-on template classes.
\item \texttt{\footnotesize SelfHashtable} class is an optimized hashtable
for cases where objects contain their own keys. Its use is highly
recommended instead of \texttt{\footnotesize Hashtable}.
\item \texttt{\footnotesize ChunkList} class maintains a linked list of
small data items, but packs many of these items into a chunk, then
chains the chunks together. This is the most cache-friendly way of
maintaining variable length lists.
\item \texttt{\footnotesize CRC32} calculation class is useful for hashing
\item \texttt{\footnotesize CycleTimer} is useful for timing intervals with
sub-nanosecond precision using the CPU cycle counter (discussed in
Section \ref{sec:Timing}).
\end{itemize}

\subsection{Logic Standard Template Library (LogicSTL)}

The Logic Standard Template Library (LogicSTL) is an internally developed
add-on to SuperSTL which supports a variety of structures useful for
modeling sequential logic. Some of its primitives may look familiar
to Verilog or VHDL programmers. While the full documentation is in
the comments of \texttt{\small logic.h}, the following is a brief
list of its features:

\begin{itemize}
\item \texttt{\footnotesize latch} template class works like any other assignable
variable, but the new value only becomes visible after the \texttt{\footnotesize clock()}
method is called (potentially from a global clock chain).
\item \texttt{\footnotesize Queue} template class implements a general purpose
fixed size queue. The queue supports various operations from both
the head and the tail, and is ideal for modeling queues in microprocessors.
\item Iterators for \texttt{\footnotesize Queue} objects such as \texttt{\footnotesize foreach\_forward},
\texttt{\footnotesize foreach\_forward\_from}, \texttt{\footnotesize foreach\_forward\_after},
\texttt{\footnotesize foreach\_backward}, \texttt{\footnotesize foreach\_backward\_from},
\texttt{\footnotesize foreach\_backward\_before}.
\item \texttt{\footnotesize HistoryBuffer} maintains a shift register of
values, which when combined with a hash function is useful for implementing
predictor histories and the like.
\item \texttt{\footnotesize FullyAssociativeTags} template class is a general
purpose array of associative tags in which each tag must be unique.
This class uses highly efficient matching logic and supports pseudo-LRU
eviction, associative invalidation and direct indexing. It forms the
basis for most associative structures in PTLsim.
\item \texttt{\footnotesize FullyAssociativeArray} pairs a \texttt{\footnotesize FullyAssociativeTags}
object with actual data values to form the basis of a cache.
\item \texttt{\footnotesize AssociativeArray}{\footnotesize{} }divides a
\texttt{\footnotesize FullyAssociativeArray} into sets. In effect,
this class can provide a complete cache implementation for a processor.
\item \texttt{\footnotesize LockableFullyAssociativeTags}, \texttt{\footnotesize LockableFullyAssociativeArray}
and \texttt{\footnotesize LockableAssociativeArray} provide the same
services as the classes above, but support locking lines into the
cache.
\item \texttt{\footnotesize CommitRollbackCache} leverages the \texttt{\footnotesize LockableFullyAssociativeArray}
class to provide a cache structure with the ability to roll back all
changes made to memory (not just within this object, but everywhere)
after a checkpoint is made.
\item \texttt{\footnotesize FullyAssociativeTags8bit} and \texttt{\footnotesize FullyAssociativeTags16bit}
work just like \texttt{\footnotesize FullyAssociativeTags}, except
that these classes are dramatically faster when using small 8-bit
and 16-bit tags. This is possible through the clever use of x86 SSE
vector instructions to associatively match and process 16 8-bit tags
or 8 16-bit tags every cycle. In addition, these classes support features
like removing an entry from the middle of the array while compacting
entries around it in constant time. These classes should be used in
place of \texttt{\small FullyAssociativeTags} whenever the tags are
small enough (i.e. almost all tags except for memory addresses).
\item \texttt{\footnotesize FullyAssociativeTagsNbitOneHot} is similar to
\texttt{\footnotesize FullyAssociativeTagsNbit}, but the user must
guarantee that all tags are unique. This property is used to perform
extremely fast matching even with long tags (32+ bits). The tag data
is striped across multiple SSE vectors and matched in parallel, then
a clever adaptation of the sum-of-absolute-differences SSE instruction
is used to extract the single matching element (if any) in O(1) time.
\end{itemize}

\subsection{Miscellaneous Code}

The out of order simulator, ooocore.h, contains several reusable classes,
including:

\begin{itemize}
\item \texttt{\footnotesize IssueQueue} template class can be used to implement
all kinds of broadcast based issue queues
\item \texttt{\footnotesize StateList} and \texttt{\footnotesize ListOfStateLists}
is useful for collecting various lists that objects can be on into
one structure.
\end{itemize}

\chapter{\label{part:x86andUops}x86 Instructions and Micro-Ops (uops)}


\section{\label{sec:UopIntro}Micro-Ops (uops) and TransOps}

PTLsim presents to the target code a full implementation of the x86
and x86-64 instruction set (both 32-bit and 64-bit modes), including
most user and kernel level instructions supported by the Intel Pentium
4 and AMD K8 microprocessors (i.e. all standard instructions, SSE/SSE2,
x86-64 and most of x87 FP). At the present stage of development, the
vast majority of all userspace and 32-bit/64-bit privileged instructions
are supported.

The x86 instruction set is based on the two-operand CISC concept of
load-and-compute and load-compute-store. However, all modern x86 processors
(including PTLsim) do not directly execute complex x86 instructions.
Instead, these processors translate each x86 instruction into a series
of micro-operations (\emph{uops}) very similar to classical load-store
RISC instructions. Uops can be executed very efficiently on an out
of order core, unlike x86 instructions. In PTLsim, uops have three
source registers and one destination register. They may generate a
64-bit result and various x86 status flags, or may be loads, stores
or branches.

The x86 instruction decoding process initially generates translated
uops (\emph{transops}), which have a slightly different structure
than the true uops used in the processor core. Specifically, sources
and destinations are represented as un-renamed architectural registers
(or special temporary register numbers), and a variety of additional
information is attached to each uop only needed during the renaming
and retirement process. TransOps (represented by the \texttt{\small TransOp}
structure) consist of the following:

\begin{itemize}
\item \texttt{\footnotesize som}: Start of Macro-Op. Since x86 instructions
may consist of multiple transops, the first transop in the sequence
has its \texttt{\small som} bit set to indicate this.
\item \texttt{\footnotesize eom}: End of Macro-Op. This bit is set for the
last transop in a given x86 instruction (which may also be the first
uop for single-uop instructions)
\item \texttt{\footnotesize bytes}: Number of bytes in the corresponding
x86 instruction (1-15). The same \texttt{\footnotesize bytes} field
value is present in all uops comprising an x86 instruction.
\item \texttt{\footnotesize opcode}: the uop (not x86) opcode
\item \texttt{\footnotesize size}: the effective operation size (0-3, for
1/2/4/8 bytes)
\item \texttt{\footnotesize cond}\texttt{\small :} the x86 condition code
for branches, selects, sets, etc. For loads and stores, this field
is reused to specify unaligned access information as described later.
\item \texttt{\footnotesize setflags}: subset of the x86 flags set by this
uop (see Section \ref{sub:FlagsManagement})
\item \texttt{\footnotesize internal}: set for certain microcode operations.
For instance, loads and stores marked internal access on-chip registers
or buffers invisible to x86 code (e.g. machine state registers, segmentation
caches, floating point constant tables, etc).
\item \texttt{\footnotesize rd}, \texttt{\footnotesize ra}, \texttt{\footnotesize rb},
\texttt{\footnotesize rc}: the architectural source and destination
registers (see Section \ref{sub:RegisterRenaming})
\item \texttt{\footnotesize extshift}: shift amount (0-3 bits) used for
shifted adds (x86 memory addressing and LEA). The \texttt{\footnotesize rc}
operand is shifted left by this amount.
\item \texttt{\footnotesize cachelevel}: used for prefetching and non-temporal
loads and stores
\item \texttt{\footnotesize rbimm} and \texttt{\footnotesize rcimm}: signed
64-bit immediates for the rb and rc operands. These are selected by
specifying the special constant \texttt{\footnotesize REG\_imm} in
the \texttt{\footnotesize rb} and \texttt{\footnotesize rc} fields,
respectively.
\item \texttt{\footnotesize riptaken}: for branches only, the 64-bit target
RIP of the branch if it were taken.
\item \texttt{\footnotesize ripseq}: for branches only, the 64-bit sequential
RIP of the branch if it were not taken.
\end{itemize}
Appendix \ref{sec:UopReference} describes the semantics and encoding
of all uops supported by the PTLsim processor model. The following
is an overview of the common features of these uops and how they are
used to synthesize specific x86 instructions.


\section{Load-Execute-Store Operations}

Simple integer and floating point operations are fairly straightforward
to decode into loads, stores and ALU operations; a typical load-op-store
ALU operation will consist of a load to fetch one operand, the ALU
operation itself, and a store to write the result. The instruction
set also implements a number of important but complex instructions
with bizarre semantics; typically the translator will synthesize and
inject into the uop stream up to 8 uops for more complex instructions.


\section{\label{sec:OperationSizes}Operation Sizes}

Most x86-64 instructions can operate on 8, 16, 32 or 64 bits of a
given register. For 8-bit and 16-bit operations, only the low 8 or
16 bits of the destination register are actually updated; 32-bit and
64-bit operations are zero extended as with RISC architectures. As
a result, a dependency on the old destination register may be introduced
so merging can be performed. Fortunately, since x86 features destructive
overwrites of the destination register (i.e. the \texttt{\footnotesize rd}
and \texttt{\footnotesize ra} operands are the same), the \texttt{\small ra}
operand is generally already a dependency. Thus, the PTLsim uop encoding
reserves 2 bits to specify the operation size; the low bits of the
new result are automatically merged with the old destination value
(in \texttt{\footnotesize ra}) as part of the ALU logic. This applies
to the \texttt{\small mov} uop as well, allowing operations like {}``\texttt{\footnotesize mov
al,bl}'' in one uop. Loads do not support this mode, so loads into
8-bit and 16-bit registers must be followed by a separate \texttt{\footnotesize mov}
uop to truncate and merge the loaded value into the old destination
properly. Fortunately this is not necessary when the load-execute
form is used with 8-bit and 16-bit operations.

The x86 ISA defines some bizarre byte operations as a carryover from
the ancient 8086 architecture; for instance, it is possible to address
the second byte of many integer registers as a separate register (i.e.
as \texttt{\footnotesize ah}, \texttt{\footnotesize bh}, \texttt{\footnotesize ch},
\texttt{\footnotesize dh}). The \texttt{\footnotesize mask} uop is
used for handling this rare but important set of operations.


\section{\label{sub:FlagsManagement}Flags Management and Register Renaming}

Many x86 arithmetic instructions modify some or all of the processor's
numerous status and condition flag bits, but only 5 are relevant to
normal execution: Zero, Parity, Sign, Overflow, Carry. In accordance
with the well-known {}``ZAPS rule'', any instruction that updates
any of the Z/P/S flags updates all three flags, so in reality only
three flag entities need to be tracked: ZPS, O, F ({}``ZAPS'' also
includes an Auxiliary flag not accessible by most modern user instructions;
it is irrelevant to the discussion below).

The x86 flag update semantics can hamper out of order execution, so
we use a simple and well known solution. The 5 flag bits are attached
to each result and physical register (along with \emph{invalid} and
\emph{waiting} bits used by some cores); these bits are then consumed
along with the actual result value by any consumers that also need
to access the flags. It should be noted that not all uops generate
all the flags as well as a 64-bit result, and some uops only generate
flags and no result data. 

The register renaming mechanism is aware of these semantics, and tracks
the latest x86 instruction in program order to update each set of
flags (ZAPS, C, O); this allows branches and other flag consumers
to directly access the result with the most recent program-ordered
flag updates yet still allows full out of order scheduling. To do
this, x86 processors maintain three separate rename table entries
for the ZAPS, CF, OF flags in addition to the register rename table
entry, any or all of which may be updated when uops are renamed. The
\texttt{\small TransOp} structure for each uop has a 3-bit \texttt{\small setflags}
field filled out during decoding in accordance with x86 semantics;
the \texttt{\small SETFLAG\_ZF}, \texttt{\small SETFLAG\_CF}, \texttt{\small SETFLAG\_OF}
bits in this field are used to determine which of the ZPS, O, F flag
subsets to rename.

As mentioned above, any consumer of the flags needs to consult at
most three distinct sources: the last ZAPS producer, the Carry producer
and the Overflow producer. This conveniently fits into PTLsim's three-operand
uop semantics. Various special uops access the flags associated with
an operand rather than the 64-bit operand data itself. Branches always
take two flag sources, since in x86 this is enough to evaluate any
possible condition code combination (the \texttt{\footnotesize cond\_code\_to\_flag\_regs}
array provides this mapping).

Various ALU instructions consume only the flags part of a source physical
register; these include \texttt{\footnotesize addc} (add with carry),
\texttt{\footnotesize rcl}\texttt{\small /}\texttt{\footnotesize rcr}{\footnotesize{}
}(rotate carry), \texttt{\footnotesize sel.}\texttt{\emph{\footnotesize cc}}
(select for conditional moves) and so on. Finally, the \texttt{\footnotesize collcc}
uop takes three operands (the latest producer of the ZAPS, CF and
OF flags) and merges the flag components of each operand into a single
flag set as its result.

PTLsim also provides compound compare-and-branch uops (\texttt{\footnotesize br.sub.cc}
and \texttt{\footnotesize br.and.cc}); these are currently used mostly
in microcode, but a core could dynamically merge \texttt{\footnotesize CMP}
or \texttt{\footnotesize TEST} and \texttt{\footnotesize Jcc} instructions
into these uops; this is exactly what the Intel Core 2 and a few research
processors already do.


\section{x86-64}

The 64-bit x86-64 instruction set is a fairly straightforward extension
of the 32-bit IA-32 (x86) instruction set. The x86-64 ISA was introduced
by AMD in 2000 with its K8 microarchitecture; the same instructions
were subsequently plagiarized by Intel under a different name ({}``EM64T'')
several years later. In addition to extending all integer registers
and ALU datapaths to 64 bits, x86-64 also provides a total of 16 integer
general purpose registers and 16 SSE (vector floating and fixed point)
registers. It also introduced several 64-bit address space simplifications,
including RIP-relative addressing and corresponding new addressing
modes, and eliminated a number of legacy features from 64-bit mode,
including segmentation, BCD arithmetic, some byte register manipulation,
etc. Limited forms of segmentation are still present to allow thread
local storage and mark code segments as 64-bit. In general, the encoding
of x86-64 and x86 are very similar, with 64-bit mode adding a one
byte REX prefix to specify additional bits for source and destination
register indexes and effective address size. As a result, both variants
can be decoded by similar decoding logic into a common set of uops.


\section{\label{sub:UnalignedLoadsAndStores}Unaligned Loads and Stores}

Compared to RISC architectures, the x86 architecture is infamous for
its relatively widespread use of unaligned memory operations; any
implementation must efficiently handle this scenario. Fortunately,
analysis shows that unaligned accesses are rarely in the performance
intensive parts of a modern program (with the exception of certain
media processing algorithms). Once a given load or store is known
to frequently have an unaligned address, it can be preemptively split
into two aligned loads or stores at decode time. PTLsim does this
by initially causing all unaligned loads and stores to raise an \texttt{\footnotesize UnalignedAccess}
internal exception, forcing a pipeline flush. At this point, the special
\texttt{\footnotesize unaligned} bit is set for the problem load or
store uop in its translated basic block representation. The next time
the offending uop is encountered, it will be split into two parts
very early in the pipeline.

PTLsim includes special uops to handle loads and stores split into
two in this manner. The \texttt{\footnotesize ld.lo} uop rounds down
its effective address $\left\lfloor A\right\rfloor $ to the nearest
64-bit boundary and performs the load. The \texttt{\footnotesize ld.hi}
uop rounds up to $\left\lceil A+8\right\rceil $, performs another
load, then takes as its third rc operand the first (\texttt{\footnotesize ld.lo})
load's result. The two loads are concatenated into a 128-bit word
and the final unaligned data is extracted. Stores are handled in a
similar manner, with \texttt{\footnotesize st.lo} and \texttt{\footnotesize st.hi}
rounding down and up to store parts of the unaligned value in adjacent
64-bit blocks. Depending on the core model, these unaligned load or
store pairs access separate store buffers for each half as if they
were independent.


\section{Repeated String Operations}

The x86 architecture allows for repeated string operations, including
block moves, stores, compares and scans. The iteration count of these
repeated operations depends on a combination of the \texttt{\footnotesize rcx}
register and the flags set by the repeated operation (e.g. compare).
To translate these instructions, PTLsim treats the \texttt{\footnotesize rep
xxx} instruction as a single basic block; any basic block in progress
before the repeat instruction is terminated and the repeat is decoded
as a separate basic block. To handle the unusual case where the repeat
count is zero, a check uop (see below) is inserted at the top of the
loop to protect against this case; PTLsim simply bypasses the offending
block if the check fails.


\section{Checks and SkipBlocks}

PTLsim includes special uops (\texttt{\footnotesize chk.and.cc}, \texttt{\footnotesize chk.sub.cc})
that compare two values or condition codes and cause a special internal
exception if the result is true. The \texttt{\footnotesize SkipBlock}
internal exception generated by these uops tells the core to literally
annul all uops in this instruction, dynamically turning it into a
nop. As described above, this is useful for string operations where
a zero count causes all of the instruction's side effects to be annulled.
Similarly, the \texttt{\footnotesize AssistCheck} internal exception
dynamically turns the instruction into an assist, for those cases
where certain rare conditions may require microcode intervention more
complex than can be inlined into the decoded instruction stream.


\section{\label{sec:ShiftRotateProblems}Shifts and Rotates}

The shift and rotate instructions have some of the most bizarre semantics
in the entire x86 instruction set: they may or may not modify a subset
of the flags depending on the rotation count operand, which we may
not even know until the instruction issues. For fixed shifts and rotates,
these semantics can be preserved by the uops generated, however variable
rotations are more complex. The \texttt{\footnotesize collcc} uop
is put to use here to collect all flags; the collected result is then
fed into the shift or rotate uop as its \texttt{\footnotesize rc}
operand; the uop then replicates the precise x86 behavior (including
rotates using the carry flag) according to its input operands.


\section{SSE Support}

PTLsim provides full support for SSE and SSE2 vector floating point
and fixed point, in both scalar and vector mode. As is done in the
AMD K8 and Pentium 4, each SSE operation on a 128-bit vector is split
into two 64-bit halves; each half (possibly consisting of a 64-bit
load and one or more FPU operations) is scheduled independently. Because
SSE instructions do not set flags like x86 integer instructions, architectural
state management can be restricted to the 16 128-bit SSE registers
(represented as 32 paired 64-bit registers). The \texttt{\footnotesize mxcsr}
(media extensions control and status register) is represented as an
internal register that is only read and written by serializing microcode;
since the exception and status bits are {}``sticky'' (i.e. only
set, never cleared by hardware), this has no effect on out of order
execution. The processor's floating point units can operate in either
64-bit IEEE double precision mode or on two parallel 32-bit single
precision values.

PTLsim also includes a variety of vector integer uops used to construct
SSE2/MMX operations, including packed arithmetic and shuffles.


\section{\label{sub:x87-Floating-Point}x87 Floating Point}

The legacy x87 floating point architecture is the bane of all x86
processor vendors' existence, largely because its stack based nature
makes out of order processing so difficult. While there are certainly
ways of translating stack based instruction sets into flat addressing
for scheduling purposes, we do not do this. Fortunately, following
the Pentium III and AMD Athlon's introduction, x87 is rapidly headed
for planned obsolescence; most major applications released within
the last few years now use SSE instructions for their floating point
needs either exclusively or in all performance critical parts. To
this end, even Intel has relegated x86 support on the Pentium 4 and
Core 2 to a separate low performance legacy unit, and AMD has restricted
x87 use in 64-bit mode. For this reason, PTLsim translates legacy
x87 instructions into a serialized, program ordered and emulated form;
the hardware does not contain any x87-style 80-bit floating point
registers (all floating point hardware is 32-bit and 64-bit IEEE compliant).
We have noticed little to no performance problem from this approach
when examining typical binaries, which rarely if ever still use x87
instructions in compute-intensive code.


\section{Floating Point Unavailable Exceptions}

The x86 architecture specifies a mode in which all floating point
operations (SSE and x87) will trigger a Floating Point Unavailable
exception (\texttt{\footnotesize EXCEPTION\_x86\_fpu\_not\_avail},
vector 0x7) if the \texttt{\footnotesize TS} (task switched) bit in
control register \texttt{\footnotesize CR0} is set. This allows the
kernel to defer saving the floating point registers and state of the
previously scheduled thread until that state is actually modified,
thus speeding up context switches. PTLsim supports this feature by
requiring any commits to the floating point state (SSE XMM registers,
x87 registers or any floating point related control or status registers)
to check the \texttt{\footnotesize uop.is\_sse} and \texttt{\footnotesize uop.is\_x87}
bits in the uop. If either of these is set, the pipeline must be flushed
and redirected into the kernel so it can save the FPU state.


\section{Assists}

Some operations are too complex to inline directly into the uop stream.
To perform these instructions, a special uop (\texttt{\footnotesize brp}:
branch private) is executed to branch to an \emph{assist} function
implemented in microcode. In PTLsim, some assist functions are implemented
as regular C/C++ or assembly language code when they interact with
the rest of the virtual machine. Examples of instructions requiring
assists include system calls, interrupts, some forms of integer division,
handling of rare floating point conditions, CPUID, MSR reads/writes,
various x87 operations, any serializing instructions, etc. These are
listed in the \texttt{\footnotesize ASSIST\_xxx} enum found in \texttt{\footnotesize decode.h}.

Prior to entering an assist, uops are generated to load the \texttt{\footnotesize REG\_selfrip}
and \texttt{\footnotesize REG\_nextrip} internal registers with the
RIP of the instruction itself and the RIP after its last byte, respectively.
This lets the assist microcode correctly update RIP before returning,
or signal a fault on the instruction if needed. Several other assist
related registers, including \texttt{\footnotesize REG\_ar1}, \texttt{\footnotesize REG\_ar2},
\texttt{\footnotesize REG\_ar3}, are used to store parameters passed
to the assist. These registers are not architecturally visible, but
must be renamed and separately maintained by the core as if they were
part of the user-visible state.

While the exact behavior depends on the core model (out of order,
SMT, sequential, etc), generally when the processor fetches an assist
(\texttt{\footnotesize brp} uop), the frontend pipeline is stalled
and execution waits until the \texttt{\footnotesize brp} commits,
at which point an assist function within PTLsim is called. This is
necessary because assists are not subject to the out of order execution
mechanism; they directly update the architectural registers on their
own. In a real processor there are slightly more efficient ways of
doing this without flushing the pipeline, however in PTLsim assists
are sufficiently rare that the performance impact is negligible and
this approach significantly reduces complexity. For the out of order
core, the exact mechanism used is described in Section \ref{sec:PipelineFlushesAndBarriers}.


\chapter{\label{sec:BasicBlockCache}Decoder Architecture and Basic Block
Cache}


\section{Basic Block Cache}

As described in Section \ref{sec:UopIntro}, x86 instructions are
decoded into transops prior to actual execution by the core. To achieve
high performance, PTLsim maintains a \emph{basic block cache} (BB
cache) containing the program ordered translated uop (\emph{transop})
sequence for previously decoded basic blocks in the program. Each
basic block (\texttt{\footnotesize BasicBlock} structure) consists
of up to 64 uops and is terminated by either a control flow operation
(conditional, unconditional, indirect branch) or a barrier operation,
i.e. a microcode assist (including system calls and serializing instructions).


\section{\label{sec:RIPVirtPhys}Identifying Basic Blocks}

In a userspace only simulator, the RIP of a basic block's entry point
(plus a few other attributes described below) serves to uniquely identify
that basic block, and can be used as a key in accessing the basic
block cache. In a full system simulator, the BB cache must be indexed
by much more than just the virtual address, because of potential virtual
page aliasing and the need to persistently cache translations across
context switches. The following fields, in the \emph{RIPVirtPhys}
structure, are required to correctly access the BB cache in any full
system simulator or binary translation system (128 bits total):

\begin{itemize}
\item \texttt{\footnotesize rip:} Virtual address of first instruction in
BB (48 bits), since embedded RIP-relative constants and branch encodings
depend on this. Modern OS's map shared libraries and binaries at the
same addresses every time, so translation caching remains effective
across runs.
\item \texttt{\footnotesize mfnlo:} MFN (Machine Frame Number, i.e. physical
page frame number) of first byte in BB (28 bits), since we need to
handle self modifying code invalidations based on physical addresses
(because of possible virtual page aliasing in multiple page tables)
\item \texttt{\footnotesize mfnhi:} MFN of last byte in BB (28 bits), since
a single x86 basic block can span up to two pages. In pathological
cases, it is possible to create two page tables that both map the
same MFN X at virtual address V, but map different MFNs at virtual
address V+4096. If an instruction crosses this page boundary, the
meaning of the instruction bytes on the second page will be different;
hence we must take into account both physical pages to look up the
correct translation.
\item Context info (up to 24 bits), since the uops generated depend on the
current CPU mode and CS descriptor settings

\begin{itemize}
\item \texttt{\footnotesize use64:} 32-bit or 64-bit mode? (encoding differences)
\item \texttt{\footnotesize kernel:} Kernel or user mode?
\item \texttt{\footnotesize df:} EFLAGS status (direction flag, etc)
\item Other info (e.g. segmentation assumptions, etc.)
\end{itemize}
\end{itemize}
The basic block cache is always indexed using an \texttt{\footnotesize RIPVirtPhys}
structure instead of a simple RIP. To do this, the \texttt{\footnotesize RIPVirtPhys.rip}
field is set to the desired RIP, then \texttt{\footnotesize RIPVirtPhys.update(ctx)}
is called to translate the virtual address onto the two physical page
MFNs it could potentially span (assuming the basic block crosses two
pages).

Notice that the other attribute bits (\texttt{\footnotesize use64},
\texttt{\footnotesize kernel}, \texttt{\footnotesize df}) mean that
two distinct basic blocks may be decoded from the exact same RIP on
the same physical page(s), yet the uops in each translated basic block
will be different because the two basic blocks were translated in
a different context (relative to these attribute bits). This is especially
important for x86 string move/compare/store/load/scan instructions
(\texttt{\footnotesize MOVSB}, \texttt{\footnotesize CMPSB}, \texttt{\footnotesize STOSB},
\texttt{\footnotesize LODSB}, \texttt{\footnotesize SCASB}), since
the correct increment constants depend on the state of the direction
flag in the context in which the BB was used. Similarly, if a user
program tries to decode a supervisor-only opcode, code to call the
general protection fault handler will be produced instead of the real
uops produced only in kernel mode.


\section{\label{sec:InvalidTranslations}Invalid Translations}

The \texttt{\footnotesize BasicBlockCache.translate(ctx, rvp)} function
\emph{always} returns a \texttt{\footnotesize BasicBlock} object,
even if the specified RIP was on an invalid page or some of the instruction
bytes were invalid. When decoding cannot continue for some reason,
the decoder simply outputs a microcode branch to one of the following
assists:

\begin{itemize}
\item \texttt{\footnotesize ASSIST\_INVALID\_OPCODE} when the opcode or
instruction operands are invalid relative to the current context.
\item \texttt{\footnotesize ASSIST\_EXEC\_PAGE\_FAULT} when the specified
RIP falls on an invalid page. This means a page is marked as not present
in the current page table at the time of decoding, or the page is
present but has its NX (no execute) bit set in the page table entry.
The \texttt{\footnotesize EXEC\_PAGE\_FAULT} assist is also generated
when the page containing the RIP itself is valid, but part of an instruction
extends beyond that page onto an invalid page. The decoder tries to
decode as many instruction bytes as possible, but will insert an \texttt{\footnotesize EXEC\_PAGE\_FAULT}
assist whenever it determines, based on the bytes already decoded,
that the remainder of the instruction would fall on the invalid page.
\item \texttt{\footnotesize ASSIST\_GP\_FAULT} when attempting to decode
a restricted kernel-only opcode while running in user mode.
\end{itemize}
Before redirecting execution to the kernel's exception handler, the
\texttt{\footnotesize EXEC\_PAGE\_FAULT} microcode verifies that the
page in question is still invalid. This avoids a spurious page fault
in the case where an instruction was originally decoded on an invalid
page, but the page tables were updated after the translation was first
made such that the page is now valid. When this is the case, all bogus
basic blocks on the page (which were decoded into a call to \texttt{\footnotesize EXEC\_PAGE\_FAULT})
must be invalidated, allowing a correct translation to be made now
that the page is valid. The page at the virtual address after the
page in question may also need to be invalidated in the case where
some instruction bytes cross the page boundary.


\section{\label{sec:SelfModifyingCode}Self Modifying Code}

In x86 processors, the translation process is considerably more complex,
because of self modifying code (SMC) and its variants. Specifically,
the instruction bytes of basic blocks that have already been translated
and cached may be overwritten; these old translations must be discarded.
The x86 architecture guarantees that all code modifications will be
visible immediately after the instruction making the modification;
unlike other architectures, no {}``instruction cache flush'' operation
is provided. Several kinds of SMC must be handled correctly:

\begin{itemize}
\item Classical SMC: stores currently in the pipeline overwrite other instructions
that have already been fetched into the pipeline and even speculatively
executed out of order;
\item Indirect SMC: stores write to a page on which previously translated
code used to reside, but that page is now being reused for unrelated
data or new code. This case frequently arises in operating system
kernels when pages are swapped in and out from disk.
\item Cross-modifying SMC: in a multiprocessor system, one processor overwrites
instructions that are currently in the pipeline on some other core.
The x86 standard is ambiguous here; technically no pipeline flush
and invalidate is required; instead, the cache coherence mechanism
and software mutexes are expected to prevent this case.
\item External SMC: an external device uses direct memory access (DMA) to
overwrite the physical DRAM page containing previously translated
code. In theory, this can happen while the affected instructions are
in the pipeline, but in practice no operating system would ever allow
this. However, we still must invalidate any translations on the target
page to prevent them from being looked up far in the future.
\end{itemize}
To deal with all these forms of SMC, PTLsim associates a {}``dirty''
bit with every physical page (this is unrelated to the {}``dirty''
bit in user-visible page table entries). Whenever the first uop in
an x86 instruction (i.e. the {}``SOM'', start-of-macro-op uop) commits,
the current context is used to translate its RIP into the physical
page MFN on which it resides, as described in Section \ref{sec:RIPVirtPhys}.
If the instruction's length in bytes causes it to overlap onto a second
page, that high MFN is also looked up (using the virtual address \emph{rip}
+ 4096). If the dirty bits for either the low or high MFN are set,
this means the instruction bytes may have been modified sometime after
the time they were last translated and added to the basic block cache.
In this case, the pipeline must be flushed, and all basic blocks on
the target MFN (and possibly the overlapping high MFN) must be invalidated
before clearing the dirty bit. Technically the RIP-to-physical translation
would be done in the instruction fetch stage in most core models,
then simply stored as an \texttt{\footnotesize RIPVirtPhys} structure
inside the uop until commit time.

The dirty bit can be set by several events. Obviously any store uops
will set the dirty bit (thus handling the classical, indirect and
cross-modifying cases), but notice that this bit is not checked again
until the first uop in the \emph{next} x86 instruction. This behavior
is required because it is perfectly legal for an x86 store to overwrite
its own instruction bytes, but this does not become visible until
the same instruction executes a second time (otherwise an infinite
loop of invalidations would occur). Microcoded x86 instructions implemented
by PTLsim itself set dirty bits when their constituent internal stores
commit. Finally, DMA transfers and external writes also set the dirty
bit of any pages touched by the DMA operation.

The dirty bit is only cleared when all translated basic blocks are
invalidated on a given page, and it remains clear until the first
write to that page. However, no action is taken when additional basic
blocks are decoded from a page already marked as dirty. This may seem
counterintuitive, but it is necessary to avoid deadlock: if the page
were invalidated and retranslated at fetch time, future stages in
a long pipeline could potentially still have references to unrelated
basic blocks on the page being invalidated. Hence, all invalidations
are checked and processed only at commit time.

Other binary translation based software and hardware \cite{TransmetaPatent.TBit,VMware,QEMU,Simics,SimNow}
have special mechanisms for write protecting physical pages, such
that when a page with translations is first written by stores or DMA,
the system immediately invalidates all translations on that page.
Unfortunately, this scheme has a number of disadvantages. First, patents
cover its implementation \cite{TransmetaPatent.SelfRevalTrans,TransmetaPatent.SubPageTBit,TransmetaPatent.TBit},
which we would like to avoid. In addition, our design eliminates forced
invalidations when the kernel frees up a page containing code that's
immediately overwritten with normal user data (a very common pattern
according to our studies). If that page is never executed again, any
translations from it will be discarded in the background by the LRU
mechanism, rather than interrupting execution to invalidate translations
that will never be used again anyway. Fortunately, true classical
SMC is very rare in modern x86 code, in large part because major microprocessors
have slapped a huge penalty on its use (particularly in the case of
the Pentium 4 and Transmeta processors, both of which store translated
uops in a cache similar to PTLsim's basic block cache).


\section{\label{sec:BasicBlockReclaim}Memory Management of the Basic Block
Cache}

The PTLsim memory manager (in \texttt{\footnotesize mm.cpp}, see Section
\ref{sec:MemoryManager} for details) implements a reclaim mechanism
in which other subsystems register functions that get called when
an allocation fails. The basic block cache registers a callback, \texttt{\footnotesize bbcache\_reclaim()}
and \texttt{\footnotesize BasicBlockCache::reclaim()}, to invalidate
and free basic blocks when PTLsim runs out of memory.

The algorithm used to do this is a pseudo-LRU design. Every basic
block has a \texttt{\footnotesize lastused} field that gets updated
with the current cycle number whenever \texttt{\footnotesize BasicBlock::use(sim\_cycle)}
is called (for instance, in the fetch stage of a core model). The
reclaim algorithm goes through all basic blocks and calculates the
oldest, average and newest \texttt{\footnotesize lastused} cycles.
The second pass then invalidates any basic blocks that fall below
this average cycle; typically around half of all basic blocks fall
in the least recently used category. This strategy has proven very
effective in freeing up a large amount of space without discarding
currently hot basic blocks.

Each basic block also has a reference counter, \texttt{\footnotesize refcount},
to record how many pointers or references to that basic block currently
exist anywhere inside PTLsim (especially in the pipelines of core
models). The \texttt{\footnotesize BasicBlock::acquire()} and \texttt{\footnotesize release()}
methods adjust this counter. Core models should acquire a basic block
once for every uop in the pipeline within that basic block; the basic
block is released as uops commit or are annulled. Since basic blocks
may be speculatively translated in the fetch stage of core models,
this guarantees that live basic blocks currently in flight are never
freed until they actually leave the pipeline.


\chapter{PTLsim Support Subsystems}


\section{\label{sec:UopImplementations}Uop Implementations}

PTLsim provides implementations for all uops in the \texttt{\footnotesize uopimpl.cpp}
file. C++ templates are combined with gcc's smart inline assembler
type selection constraints to translate all possible permutations
(sizes, condition codes, etc) of each uop into highly optimized code.
In many cases, a real x86 instruction is used at the core of each
corresponding uop's implementation; code after the instruction just
captures the generated x86 condition code flags, rather than having
to manually emulate the same condition codes ourselves. The code implementing
each uop is then called from elsewhere in the simulator whenever that
uop must be executed. Note that loads and stores are implemented elsewhere,
since they are too dependent on the specific core model to be expressed
in this generic manner.

An additional optimization, called \emph{synthesis}, is also used
whenever basic blocks are translated. Each uop in the basic block
is mapped to the address of a native PTLsim function in \texttt{\footnotesize uopimpl.cpp}
implementing the semantics of that uop; this function pointer is stored
in the \texttt{\footnotesize synthops{[}]} array of the \texttt{\footnotesize BasicBlock}
structure. This saves us from having to use a large jump table later
on, and can map uops to pre-compiled templates that avoid nearly all
further decoding of the uop during execution.


\section{Configuration Parser}

PTLsim supports a wide array of command line or scriptable configuration
options, described in Section \ref{sec:ConfigurationOptions}. The
configuration parser engine (used by both PTLsim itself and utilities
like PTLstats) is in \texttt{\footnotesize config.cpp} and \texttt{\footnotesize config.h}.
For PTLsim itself, each option is declared in three places:

\begin{itemize}
\item \texttt{\footnotesize ptlsim.h} declares the \texttt{\footnotesize PTLsimConfig}
structure, which is available from anywhere as the \texttt{\footnotesize config}
global variable. The fields in this structure must be of one of the
following types: \texttt{\footnotesize W64} (64-bit integer), \texttt{\footnotesize double}
(floating point), \texttt{\footnotesize bool} (on/off boolean), or
\texttt{\footnotesize stringbuf} (for text parameters).
\item \texttt{\footnotesize ptlsim.cpp} declares the \texttt{\footnotesize PTLsimConfig::reset()}
function, which sets each option to its default value.
\item \texttt{\footnotesize ptlsim.cpp} declares the \texttt{\footnotesize ConfigurationParser<PTLsimConfig>::setup()}
template function, which registers all options with the configuration
parser.
\end{itemize}

\section{\label{sec:MemoryManager}Memory Manager}


\subsection{Memory Pools}

PTLsim uses its own custom memory manager for all allocations, given
its specialized constraints (particularly for PTLsim/X, which runs
on the bare hardware). The PTLsim memory manager (in \texttt{\footnotesize mm.cpp})
uses three key structures.

The \emph{page allocator} allocates spans of one or more virtually
contiguous pages. In userspace-only PTLsim, the page allocator doesn't
really exist: it simply calls \texttt{\footnotesize mmap()} and \texttt{\footnotesize munmap()},
letting the host kernel do the actual allocation. In the full system
PTLsim/X, the page allocator actually works with physical pages and
is based on the extent allocator (see below). The \texttt{\footnotesize ptl\_alloc\_private\_pages()}
and \texttt{\footnotesize ptl\_free\_private\_pages()} functions should
be used to directly allocate page-aligned memory (or individual pages)
from this pool.

The \emph{general allocator} uses the \texttt{\footnotesize ExtentAllocator}
template class to allocate large objects (greater than page sized)
from a pool of free extents. This allocator automatically merges free
extents and can find a matching free block in O(1) time for any allocation
size. The general allocator obtains large chunks of memory (typically
64 KB at once) from the page allocator, then sub-divides these extents
into individual allocations.

The \emph{slab allocator} maintains a pool of page-sized {}``slabs''
from which fixed size objects are allocated. Each page only contains
objects of one size; a separate slab allocator handles each size from
16 bytes up to 1024 bytes, in 16-byte increments. The allocator provides
extremely fast allocation performance for object oriented programs
in which many objects of a given size are allocated. The slab allocator
also allocates one page at a time from the global page allocator.
However, it maintains a pool of empty pages to quickly satisfy requests.
This is the same architecture used by the Linux kernel to satisfy
memory requests.

The \texttt{\footnotesize ptl\_mm\_alloc()} function intelligently
decides from which of the two allocators (general or slab) to allocate
a given sized object, based on the size in bytes, object type and
caller. The standard \texttt{\footnotesize new} operator\texttt{\footnotesize{}
and malloc()} both use this function. Similarly, the \texttt{\footnotesize ptl\_mm\_free()}
function frees memory. PTLsim uses a special bitmap to track which
pages are slab allocator pages; if a pointer falls within a slab,
the slab deallocator is used; otherwise the general allocator is used
to free the extent.


\subsection{Garbage Collection and Reclaim Mechanism}

The memory manager implements a garbage collection mechanism with
which other subsystems register reclaim functions that get called
when an allocation fails. The \texttt{\footnotesize ptl\_mm\_register\_reclaim\_handler()}
function serves this role. Whenever an allocation fails, the reclaim
handlers are called in sequence, followed by an extent cleanup pass,
before retrying the allocation. This process repeats until the allocation
succeeds or an abort threshold is reached.

The reclaim function gets passed two parameters: the size in bytes
of the failed allocation, and an \emph{urgency} parameter. If \emph{urgency}
is 0, the subsystem registering the callback should do everything
in its power to free all memory it owns. Otherwise, the subsystem
should progressively trim more and more unused memory with each call
(and increasing urgency). \emph{Under no circumstances} is a reclaim
handler allowed to allocate \emph{any} additional memory! Doing so
will create an infinite loop; the memory manager will detect this
and shut down PTLsim if it is attempted.


\chapter{\label{sec:StatisticsInfrastructure}Statistics Collection and Analysis}


\section{PTLsim Statistics Data Store}


\subsection{Introduction}

PTLsim maintains a huge number of statistical counters and data points
during the simulation process, and can optionally save this data to
a file by using the {}``\texttt{\footnotesize -stats} \emph{filename}''
configuration option. The data store is a binary file format used
to efficiently capture large quantities of statistical information
for later analysis. This file format supports storing multiple regular
or triggered snapshots of all counters. Snapshots can be subtracted,
averaged and extensively manipulated, as will be described later on.

PTLsim makes it trivial to add new performance counters to the statistics
data tree. All counters are defined in \texttt{\footnotesize stats.h}
as a tree of nested structures; the top-level \texttt{\footnotesize PTLsimStats}
structure is mapped to the global variable \texttt{\footnotesize stats},
so counters can be directly updated from within the code by simple
increments, e.g. \texttt{\footnotesize stats.xxx.yyy.zzz.countername++}.
Every node in the tree can be either a \texttt{\footnotesize struct},
\texttt{\footnotesize W64}{\footnotesize{} }(64-bit integer), \texttt{\footnotesize double}
(floating point) or \texttt{\footnotesize char} (string) type; arrays
of these types are also supported. In addition, various attributes,
described below, can be attached to each node or counter to specify
more complex semantics, including histograms, labeled arrays, summable
nodes and so on.

PTLsim comes with a special script, \texttt{\footnotesize dstbuild}
({}``data store template builder'') that parses \texttt{\footnotesize stats.h}
and constructs a binary representation (a {}``template'') describing
the structure; this template data is then compiled into PTLsim. Every
time PTLsim creates a statistics file, it first writes this template,
followed by the raw \texttt{\footnotesize PTLsimStats} records and
an index of those records by name. In this way, the complete data
store tree can be reconstructed at a later time even if the original
\texttt{\footnotesize stats.h} or PTLsim version that created the
file is unavailable. This scheme is analogous to the separation of
XML schemas (the template) from the actual XML data (the stats records),
but in our case the template and data is stored in binary format for
efficient parsing.

We suggest using the data store mechanism to store \emph{all} statistics
generated by your additions to PTLsim, since this system has built-in
support for snapshots, checkpointing and structured easy to parse
data (unlike simply writing values to a text file). It is further
suggested that only raw values be saved, rather than doing computations
in the simulator itself - leave the analysis to PTLstats after gathering
the raw data. If some limited computations do need to be done before
writing each statistics record, PTLsim will call the PTLsimMachine::update\_stats()
virtual method to allow your model a chance to do so before writing
the counters.


\subsection{\label{sec:StatisticsNodeAttributes}Node Attributes}

After each node or counter is declared, one of several special C++-style
{}``//'' comments can be used to specify \emph{attributes} for that
node:

\begin{itemize}
\item \texttt{\small struct Name \{ // rootnode:}{\small \par}


The node is at the root of the statistics tree (typically this only
applies to the PTLsimStats structure itself)

\item \texttt{\small struct Name \{ // node: summable}{\small \par}


All subnodes and counters under this node are assumed to total 100\%
of whatever quantity is being measured. This attribute tells PTLstats
to print percentages next to the raw values in this subtree for easier
viewing.

\item \texttt{\small W64 name{[}arraysize]; // histo:}{\small{} }\texttt{\emph{\small min,}}{\small{}
}\texttt{\emph{\small max,}}{\small{} }\texttt{\emph{\small stride}}{\small \par}


Specifies that the array of counters forms a \emph{histogram}, i.e.
each slot in the array represents the number of occurrences of one
event out of a mutually exclusive set of events. The \emph{min} parameter
specifies the meaning of the first slot (array element 0), while the
\emph{max} parameter specifies the meaning of the last slot (array
element \emph{arraysize}-1). The \emph{stride} parameter specifies
how many events are counted into every slot (typically this is 1).

For example, let's say you want to measure the frequency distribution
of the number of consumers of each instruction's result, where the
maximum number of possible consumers is 256. You could specify this
as:

\begin{quote}
\texttt{\small W64 consumers{[}64+1]; // histo: 0, 256, 4}{\small \par}
\end{quote}
This histogram has a logical range of 0 to 256, but is divided into
65 slots. Because the \emph{stride} parameter is 4, any consumer counts
from 0 to 3 increment slot 0, counts from 4 to 7 increment slot 1,
and so on. When you update this counter array from inside the model,
you should do so as follows:

\begin{quote}
\texttt{\small stats.xxx.yyy.consumers{[}min(n / 4, 64)]++;}{\small \par}
\end{quote}
\item \texttt{\small W64 name{[}arraysize]; // label: namearray}{\small \par}


Specifies that the array of counters is a histogram of named, mutually
exclusive events, rather than simply raw numbers (as with the \emph{histo}
attribute). The \emph{namearray} must be the name of an array of \emph{arraysize}
strings, with one entry per event.

For example, let's say you want to measure the frequency distribution
of uop types PTLsim is executing. If there are OPCLASS\_COUNT, you
could declare the following:

\begin{quote}
\texttt{\small W64 opclass{[}OPCLASS\_COUNT]; // label: opclass\_names}{\small \par}
\end{quote}
In some header file included by \texttt{\small stats.h}, you need
to declare the actual array of slot labels:

\begin{quote}
\texttt{\small static const char{*} opclass\_names{[}OPCLASS\_COUNT]
= \{''logic'', ''addsub'', ''addsubc'', ...\};}{\small \par}
\end{quote}
\end{itemize}

\subsection{\label{sec:StatisticsOptions}Configuration Options}

PTLsim supports several options related to the statistics data store:

\begin{itemize}
\item \texttt{\small -stats}{\small{} }\texttt{\emph{\small filename}}{\small \par}


Specify the filename to which statistics data is written. In reality,
two files are created: \emph{filename} contains the template and snapshot
index, while \emph{filename.data} contains the raw data.

\item \texttt{\small -snapshot-cycles}{\small{} }\texttt{\emph{\small N}}{\small \par}


Creates a snapshot every N simulation cycles, numbered consecutively
starting from 0. Without this option, only one snapshot, named \texttt{\small final},
is created at the end of the simulation run.

\item \texttt{\small -snapshot-now}{\small{} }\texttt{\emph{\small name}}{\small \par}


Creates a snapshot named \emph{name} at the current point in the simulation.
This can be used to asynchronously take a look at a simulation in
progress. \emph{This option is only available in PTLsim/X.}

\end{itemize}

\section{PTLstats: Statistics Analysis and Graphing Tools}

The \textbf{\emph{PTLstats}} program is used to analyze the statistics
data store files produced by PTLsim. PTLstats will first extract the
template stored in all data store files, and will then parse the statistics
records into a flexible tree format that can be manipulated by the
user. The following is an example of one node in the statistics tree,
as printed by PTLstats:

\begin{lyxcode}
{\small dcache~\{}{\small \par}

{\small{}~~store~\{}{\small \par}

{\small{}~~~~issue~(total~68161716)~\{}{\small \par}

{\small{}~~~~{[}~29.7\%~]~replay~(total~20218780)~\{}{\small \par}

{\small{}~~~~{[}~~0.0\%~]~sfr\_addr\_not\_ready~=~0;}{\small \par}

{\small{}~~~~{[}~16.8\%~]~sfr\_data\_and\_data\_to\_store\_not\_ready~=~3405878;}{\small \par}

{\small{}~~~~{[}~11.8\%~]~sfr\_data\_not\_ready~=~2379338;}{\small \par}

{\small{}~~~~{[}~23.4\%~]~sfr\_addr\_and\_data\_to\_store\_not\_ready~=~4740838;}{\small \par}

{\small{}~~~~{[}~24.5\%~]~sfr\_addr\_and\_data\_not\_ready~=~4951888;}{\small \par}

{\small{}~~~~{[}~23.4\%~]~sfr\_addr\_and\_data\_and\_data\_to\_store\_not\_ready~=~4740838;}{\small \par}

{\small{}~~\}}{\small \par}

{\small{}~~{[}~~0.0\%~]~exception~=~30429;}{\small \par}

{\small{}~~{[}~~7.9\%~]~ordering~=~5404592;}{\small \par}

{\small{}~~{[}~62.4\%~]~complete~=~42507854;}{\small \par}

{\small{}~~{[}~~0.0\%~]~unaligned~=~61;}{\small \par}

{\small \}}{\small \par}
\end{lyxcode}
Notice how PTLstats will automatically sum up all entries in certain
branches of the tree to provide the user with a breakdown by percentages
of the total for that subtree in addition to the raw values. This
is achieved using the {}``\texttt{\small // node: summable}'' attribute
as described in Section \ref{sec:StatisticsNodeAttributes}.

Here is an example of a labeled histogram, produced using the {}``\texttt{\small //
label: xxx}'' attribute described in Section \ref{sec:StatisticsNodeAttributes}:

\begin{lyxcode}
{\small size{[}4]~=~\{}{\small \par}

{\small{}~~ValRange:~3209623~90432573}{\small \par}

{\small{}~~Total:~~~107190122}{\small \par}

{\small{}~~Thresh:~~~~~10720}{\small \par}

{\small{}~~{[}~~6.2\%~]~~~~~~~~0~~6686971~1~(byte)}{\small \par}

{\small{}~~{[}~~6.4\%~]~~~~~~~~1~~6860955~2~(word)}{\small \par}

{\small{}~~{[}~84.4\%~]~~~~~~~~2~90432573~4~(dword)}{\small \par}

{\small{}~~{[}~~3.0\%~]~~~~~~~~3~~3209623~8~(qword)}{\small \par}

{\small \};}{\small \par}
\end{lyxcode}

\section{Snapshot Selection}

The basic syntax of the PTLstats command is {}``\texttt{\small ptlstats
-}\emph{options} \emph{filename}''. If no options are specified,
PTLstats prints out the entire statistics tree from its root, relative
to the \texttt{\small final} snapshot.

To select a specific snapshot, use the following option:

\begin{lyxcode}
{\small ptlstats~}\textbf{\small -snapshot}{\small{}~}\emph{\small name-or-number}{\small{}~...}{\small \par}
\end{lyxcode}
Snapshots may be specified by name or number.

It may be desirable to examine the difference in statistics \emph{between}
two snapshots, for instance to subtract out the counters at the starting
point of a given run or after a warmup period. The \texttt{\small -subtract}
option provides this facility, for example:

\begin{lyxcode}
{\small ptlstats~-snapshot~}\emph{\small final}{\small{}~}\textbf{\small -subtract}{\small{}~}\emph{\small startpoint}{\small{}~...}{\small \par}
\end{lyxcode}

\section{Working with Statistics Trees: Collection, Averaging and Summing}

To select a specific subtree of interest, use the syntax of the following
example:

\begin{lyxcode}
{\small ptlstats~}\textbf{\small -snapshot}{\small{}~final~}\textbf{\small -collect}{\small{}~/ooocore/dcache/load~example1.stats~example2.stats~...}{\small \par}
\end{lyxcode}
This will print out the subtree \texttt{\small /ooocore/dcache/load}
in the snapshot named \texttt{\small final} (the default snapshot)
for each of the named statistics files \texttt{\small example1.stats},
\texttt{\small example2.stats} and so on. Multiple files are generally
used to examine a specific subnode across several benchmarks.

Subtrees or individual statistics can also be summed and averaged
across many files, using the \texttt{\textbf{\small -collectsum}}
or \texttt{\textbf{\small -collectaverage}} commands in place of \texttt{\small -collect}.


\section{Traversal and Printing Options}

The \texttt{\textbf{\small -maxdepth}} option is useful for limiting
the depth (in nodes) PTLstats will descend into the specified subtree.
This is appropriate when you want to summarize certain classes of
statistics printed as percentages of the whole, yet don't want a breakdown
of every sub-statistic.

The \texttt{\textbf{\small -percent-of-toplevel}} option changes the
way percentages are displayed. By default, percentages are calculated
by dividing the total value of each node by the total of its immediate
parent node. When \texttt{\small -percent-of-toplevel} is enabled,
the divisor becomes the total of the entire subtree, possibly going
back several levels (i.e. back to the highest level node marked with
the \emph{summable} attribute), rather than each node's immediate
parent.


\section{Table Generation}

PTLstats provides a facility to easily generate R-row by C-column
data tables from a set of R benchmarks run with C different sets of
parameters. Tables can be output in a variety of formats, including
plain text with tab or space delimiters (suitable for import into
a spreadsheet), \LaTeX{} (for direct insertion into research reports)
or HTML. To generate a table, use the following syntax:

\begin{lyxcode}
{\small ptlstats~}\textbf{\small -table}{\small{}~/final/summary/cycles~-rows~gzip,gcc,perlbmk,mesa~-cols~small,large,huge~-table-pattern~\char`\"{}\%row/ptlsim.stats.\%col\char`\"{}}{\small \par}
\end{lyxcode}
In this example, the benchmarks ({}``gzip'', {}``gcc'', {}``perlbmk'',
{}``mesa'') will form the rows of the table, while three trials
done for each benchmark ({}``small'', {}``large'', {}``huge'')
will be listed in the columns. The row and column names will be combined
using the pattern {}``\texttt{\small \%row/ptlsim.stats.\%col}{}``
to generate statistics data store filenames like {}``\texttt{\small gzip/ptlsim.stats.small}''.
PTLstats will then load the data store for each benchmark and trial
combination to create the table.

Notice that you must create your own scripts, or manually run each
benchmark and trial with the desired PTLsim options, plus {}``\texttt{\small -stats
ptlsim.stats.}\texttt{\emph{\small trialname}}''. PTLstats will only
report these results in table form; it will not actually run any benchmarks.

The \texttt{\textbf{\small -tabletype}} option specifies the data
format of the table: {}``\texttt{\small text}'' (plain text with
space delimiters, suitable for import into a spreadsheet), {}``\texttt{\small latex}''
(\LaTeX{} format, useful for directly inserting into research reports),
or {}``\texttt{\small html}'' (HTML format for web pages).

The {}``\texttt{\textbf{\small -scale-relative-to-col}}{\small{} }\texttt{\emph{\small N}}''
option forces PTLstats to compute the percentage of increase or decrease
for each cell relative to the corresponding row in some other reference
column \emph{N}. This is useful when running a {}``baseline'' case,
to be displayed as a raw value (usually the cycle count, \texttt{\small /final/summary/cycles})
in column 0, while all other experimental cases are displayed as a
percentage increase (fewer cycles, for a positive percentage) or percentage
decrease (negative value) relative to this first column (\emph{N}
= 0).


\subsection{Bargraph Generation}

In addition to creating tables, PTLstats can directly create colorful
graphs (in Scalable Vector Graphics (SVG) format) from a set of benchmarks
(specified by the \texttt{\small -rows} option) and trials of each
benchmark (specified by the \texttt{\small -cols} option). For instance,
to plot the total number of cycles taken over a set of benchmarks,
each run under different PTLsim configurations, use the following
example:

\begin{lyxcode}
{\small ptlstats~}\textbf{\small -bargraph}{\small{}~/final/summary/cycles~-rows~gzip,gcc,perlbmk,mesa~-cols~small,large,huge~-table-pattern~\char`\"{}\%row/ptlsim.stats.\%col\char`\"{}}{\small \par}
\end{lyxcode}
In this case, groups of three bars (for the trials {}``small'',
{}``large'', {}``huge'') appear for each benchmark.

The graph's layout can be extensively customized using the options
\texttt{\small -title}, \texttt{\small -width}, \texttt{\small -height}.

Inkscape (http://www.inkscape.org) is an excellent vector graphics
system for editing and formatting SVG files generated by PTLstats.


\section{Histogram Generation}

Certain array nodes in the statistics tree can be tagged as {}``histogram''
nodes by using the \texttt{\small histo:} or \texttt{\small label:}
attributes, as described in Section \ref{sec:StatisticsNodeAttributes}.
For instance, the \texttt{\small ooocore/frontend/consumer-count}
node in the out-of-order core is a histogram node. PTLstats can directly
create graphs (in Scalable Vector Graphics (SVG) format) for these
special nodes, using the \texttt{\textbf{\small -histogram}} option:

\begin{lyxcode}
{\small ptlstats~}\textbf{\small -histogram}{\small{}~/ooocore/frontend/consumer-count~>~example.svg}{\small \par}
\end{lyxcode}
The histogram's layout can be extensively customized using the options
\texttt{\small -title}, \texttt{\small -width}, \texttt{\small -height}.
In addition, the \texttt{\small -percentile} option is useful for
controlling the displayed data range by excluding data under the Nth
percentile. The \texttt{\small -logscale} and \texttt{\small -logk}
options can be used to apply a log scale (instead of a linear scale)
to the histogram bars. The syntax of these options can be obtained
by running \texttt{\small ptlstats} without arguments.


\chapter{Benchmarking Techniques}


\section{\label{sec:TriggerMode}Trigger Mode and other PTLsim Calls From
User Code}

PTLsim optionally allows user code to control the simulator mode through
the \texttt{\small ptlcall\_xxx()} family of functions found in \texttt{\small ptlcalls.h}
when trigger mode is enabled (\texttt{\small -trigger} configuration
option). This file should be included by any PTLsim-aware user programs;
these programs must be recompiled to take advantage of these features.
Amongst the functions provided by \texttt{\small ptlcalls.h} are:

\begin{itemize}
\item \texttt{\small ptlcall\_switch\_to\_sim()} is only available while
the program is executing in native mode. It forces PTLsim to regain
control and begin simulating instructions as soon as this call returns.
\item \texttt{\small ptlcall\_switch\_to\_native()} stops simulation and
returns to native execution, effectively removing PTLsim from the
loop.
\item \texttt{\small ptlcall\_marker()} simply places a user-specified marker
number in the PTLsim log file
\item \texttt{\small ptlcall\_capture\_stats()} adds a new statistics data
store snapshot at the time it is called. You can pass a string to
this function to name your snapshot, but all names must be unique.
\item \texttt{\small ptlcall\_nop()} does nothing but test the call mechanism.
\end{itemize}
In userspace PTLsim, these calls work by forcing execution to code
on a {}``gateway page'' at a specific fixed address (\texttt{\small 0x1000}
currently); PTLsim will write the appropriate call gate code to this
page depending on whether the process is in native or simulated mode.
In native mode, the call gate page typically contains a 64-to-64-bit
or 32-to-64-bit far jump into PTLsim, while in simulated mode it contains
a reserved x86 opcode interpreted by the x86 decoder as a special
kind of system call. If PTLsim is built on a 32-bit only system, no
mode switch is required.

In full system PTLsim/X, the x86 opcodes used to implement these calls
are directly handled by the PTLsim/X hypervisor as if they were actually
part of the native x86 instruction set.

Generally these calls are used to perform {}``intelligent benchmarking'':
the \texttt{\small ptlcall\_switch\_to\_sim()} call is made at the
top of the main loop of a benchmark after initialization, while the
\texttt{\small ptlcall\_switch\_to\_native()} call is inserted after
some number of iterations to stop simulation after a representative
subset of the code has completed. This intelligent approach is far
better than the blind {}``sample for N million cycles after S million
startup cycles'' approach used by most researchers.

Fortran programs will have to actually link in the \texttt{\small ptlcalls.o}
object file, since they cannot include C header files. The function
names that should be used in the Fortran code remain the same as those
from the \texttt{\small ptlcalls.h} header file.


\section{\label{sec:IPCNotes}Notes on Benchmarking Methodology and {}``IPC''}

The x86 instruction set requires some different benchmarking techniques
than classical RISC ISAs. In particular, \textbf{uIPC (Micro-Instructions
per Cycle) a NOT a good measure of performance for an x86 processor.}
Because one x86 instruction may be broken up into numerous uops, it
is never appropriate to compare IPC figures for committed x86 instructions
per clock with IPC values from a RISC machine. Furthermore, different
x86 implementations use varying numbers of uops per x86 instruction
as a matter of encoding, so even comparing the uop based IPC between
x86 implementations or RISC-like machines is inaccurate.

Users are strongly advised to use relative performance measures instead.
Comparing the total simulated cycle count required to complete a given
benchmark between different simulator configurations is much more
appropriate than IPC with the x86 instruction set. An example would
be {}``the baseline took 100M cycles, while our improved system took
50M cycles, for a 2x improvement''.


\section{\label{sec:SimulationWarmupPeriods}Simulation Warmup Periods}

In some simulators, it is possible to quickly skip through a specific
number of instructions before starting to gather statistics, to avoid
including initialization code in the statistics. In PTLsim, this is
neither necessary nor desirable. Because PTLsim directly executes
your program on the host CPU until it switches to cycle accurate simulation
mode, there is no way to count instructions in this manner. 

Many researchers have gotten in the habit of blindly skipping a large
number of instructions in benchmarks to avoid profiling initialization
code. However, this is not a very intelligent policy: different benchmarks
have different startup times until the top of the main loop is reached,
and it is generally evident from the benchmark source code where that
point should be. Therefore, PTLsim supports \textbf{trigger points:}
by inserting a special function call (\texttt{\footnotesize ptlcall\_switch\_to\_sim})
within the benchmark source code and recompiling, the \texttt{\footnotesize -trigger}
PTLsim option can be used to run the code on the host CPU until the
trigger point is reached. If the source code is unavailable, the \texttt{\footnotesize -startrip}{\footnotesize{}
}\texttt{\emph{\footnotesize 0xADDRESS}} option will start full simulation
only at a specified address (e.g. function entry point). 

If you want to warm up the cache and branch predictors prior to starting
statistics collection, combine the \texttt{\footnotesize -trigger}
option with the \texttt{\footnotesize -snapshot-cycles}{\footnotesize{}
}\texttt{\emph{\footnotesize N}} option, to start full simulation
at the top of the benchmark's main loop (where the trigger call is),
but only start gathering statistics \emph{N} cycles later, after the
processor is warmed up. Remember, since the trigger point is placed
\emph{after} all initialization code in the benchmark, in general
it is only necessary to use 10-20 million cycles of warmup time before
taking the first statistics snapshot. In this time, the caches and
branch predictor will almost always be completely overwritten many
times. This approach significantly speeds up the simulation without
any loss of accuracy compared to the \char`\"{}fast simulation\char`\"{}
mode provided by other simulators. 

In PTLstats, use the \texttt{\footnotesize -subtract} option to make
sure the final statistics don't include the warmup period before the
first snapshot. To subtract the final snapshot from snapshot 0 (the
first snapshot after the warmup period), use a command similar to
the following:

\begin{lyxcode}
{\footnotesize ptlstats~-subtract~0~ptlsim.stats}{\footnotesize \par}
\end{lyxcode}

\section{\label{sec:SequentialMode}Sequential Mode}

PTLsim supports \emph{sequential mode}, in which instructions are
run on a simple, in-order processor model (in \texttt{\footnotesize seqcore.cpp})
without accounting for cache misses, branch mispredicts and so forth.
This is much faster than the out of order model, but is obviously
slower than native execution. The purpose of sequential mode is mainly
to aid in testing the x86 to uop decoder, microcode functions and
RTL-level uop implementation code. It may also be useful for gathering
certain statistics on the instruction mix and count without running
a full simulation.

\emph{NOTE:} Sequential mode is \emph{not} intended as a {}``warmup
mode'' for branch predictors and caches. If you want this behavior,
use statistical snapshot deltas as described in Section \ref{sec:SimulationWarmupPeriods}. 

Sequential mode is enabled by specifying the {}``\texttt{\footnotesize -core
seq}'' option. It has no other core-specific options.


\part{\label{sec:PTLsimClassic}PTLsim Classic: Userspace Linux Simulation}


\chapter{Getting Started with PTLsim}

\emph{NOTE:} This part of the manual is relevant only if you are using
the classic userspace-only version of PTLsim. If you are looking for
the full system SMP/SMT version, PTLsim/X, please skip this entire
part and read Part \ref{sec:PTLsimFullSystem} instead.


\section{Building PTLsim}

Prerequisites:

\begin{itemize}
\item PTLsim can be built on \textbf{both 64-bit x86-64 machines} (AMD Athlon
64 / Opteron / Turion, Intel Pentium 4 with EM64T and Intel Core 2)
\textbf{as well as ordinary 32-bit x86 systems}. In either case, your
system must support SSE2 instructions; all modern CPUs made in the
last few years (such as Pentium 4 and Athlon 64) support this, but
older CPUs (Pentium III and earlier) specifically do \emph{not} support
PTLsim.
\item If built for x86-64, PTLsim will run both 64-bit and 32-bit programs
automatically. If built on a 32-bit Linux distribution and compiler,
PTLsim only supports ordinary x86 programs and will typically be slower
than the 64-bit build\textbf{,} even on 32-bit user programs.
\item PTLsim runs on any recent Linux 2.6 based distribution.
\item We have successfully built PTLsim with gcc 3.3, 3.4.x and 4.1.x+ (gcc
4.0.x has documented bugs affecting some of our code).
\end{itemize}
Quick Start Steps:

\begin{itemize}
\item Download PTLsim from our web site (\texttt{\footnotesize http://www.ptlsim.org/download.php}).
We recommend starting with the {}``stable'' version, since this
contains all the files you need and can be updated later if desired.
\item Unpack \texttt{\footnotesize ptlsim-2006xxxx-rXXX.tar.gz} to create
the \texttt{\footnotesize ptlsim} directory.
\item Run \texttt{\footnotesize make}.

\begin{itemize}
\item The Makefile will detect your platform and automatically compile the
correct version of PTLsim (32-bit or 64-bit).
\end{itemize}
\end{itemize}

\section{\label{sec:RunningPTLsim}Running PTLsim}

PTLsim invocation is very simple: after compiling the simulator and
making sure the \texttt{\small ptlsim} executable is in your path,
simply run:

\begin{quote}
\texttt{\footnotesize ptlsim}\texttt{\small ~} \emph{full-path-to-executable}
\emph{arguments...}
\end{quote}
PTLsim reads configuration options for running various user programs
by looking for a configuration file named \texttt{\footnotesize /home/}\texttt{\emph{\footnotesize username}}\texttt{\footnotesize /.ptlsim/}\texttt{\emph{\footnotesize path/to/program/executablename}}\texttt{\footnotesize .conf}.
To set options for each program, you'll need to create a directory
of the form \texttt{\footnotesize /home/}\texttt{\emph{\footnotesize username}}\texttt{\footnotesize /.ptlsim}
and make sub-directories under it corresponding to the full path to
the program. For example, to configure \texttt{\footnotesize /bin/ls}
you'll need to run \char`\"{}\texttt{\footnotesize mkdir /home/}\texttt{\emph{\footnotesize username}}\texttt{\footnotesize /.ptlsim/bin}''
and then edit \char`\"{}\texttt{\footnotesize /home/}\texttt{\emph{\footnotesize username}}\texttt{\footnotesize /.ptlsim/bin/ls.conf}\char`\"{}
with the appropriate options. For example, try putting the following
in \texttt{\footnotesize ls.conf} as described:

\begin{quote}
\texttt{\footnotesize -logfile ptlsim.log -loglevel 9 -stats ls.stats
-stopinsns 10000}{\footnotesize \par}
\end{quote}
Then run:

\begin{quote}
\texttt{\footnotesize ptlsim /bin/ls -la}{\footnotesize \par}
\end{quote}
PTLsim should display its system information banner, then the output
of simulating the directory listing. With the options above, PTLsim
will simulate \texttt{\footnotesize /bin/ls} starting at the first
x86 instruction in the dynamic linker's entry point, run until 10000
x86 instructions have been committed, and will then switch back to
native mode (i.e. the user code will run directly on the real processor)
until the program exits. During this time, it will compile an extensive
log of the state of every micro-operation executed by the processor
and will save it to {}``\texttt{\footnotesize ptlsim.log}'' in the
current directory. It will also create {}``\texttt{\footnotesize ls.stats}'',
a binary file containing snapshots of PTLsim's internal performance
counters. The \texttt{\footnotesize ptlstats} program (Chapter \ref{sec:StatisticsInfrastructure})
can be used to print and analyze these statistics by running {}``\texttt{\footnotesize ptlstats
ls.stats}''.


\section{\label{sec:ConfigurationOptions}Configuration Options}

PTLsim supports a variety of options in the configuration file of
each program; you can run {}``\texttt{\footnotesize ptlsim}'' without
arguments to get a full list of these options. The following sections
only list the most useful options, rather than every possible option.

The configuration file can also contain comments (starting with {}``\texttt{\footnotesize \#}''
at any point on a line) and blank lines; the first non-comment line
is used as the active configuration.

PTLsim supports multiple models of various microprocessor cores; the
{}``\texttt{\footnotesize -core} \emph{corename}'' option can be
used to choose a specific core. The default core is {}``\texttt{\footnotesize ooo}'',
the dynamically scheduled out of order superscalar core described
in great detail in Part \ref{part:OutOfOrderModel}. PTLsim also comes
with a simple sequential in-order core, {}``\texttt{\footnotesize seq}''.
It is most useful for debugging decoding and microcode issues rather
than actual performance profiling.


\section{Logging Options}

PTLsim can log all simulation events to a log file, or can be instructed
to log only a subset of these events, starting and stopping at various
points:

\begin{itemize}
\item \texttt{\footnotesize -logfile} \emph{filename}


Specifies the file to which log messages will be written.

\item \texttt{\footnotesize -loglevel} \emph{level}


Selects a subset of the events that will be logged:

\begin{itemize}
\item 0 disables logging
\item 1 displays only critical events (such as system calls and state changes)
\item 2-3 displays less critical simulator-wide events
\item 4 displays major events within the core itself (like pipeline flushes,
basic block decodes, etc)
\item 6 displays \emph{all} events that occur within each pipeline stage
of the core every cycle
\item 99 displays every possible event. This will create massive log files!
\end{itemize}
\item \texttt{\footnotesize -startlog} \emph{cycle}


Starts logging only after \emph{cycle} cycles have elapsed from the
start of the simulation.

\item \texttt{\footnotesize -startlogrip} \emph{rip}


Starts logging only after the first time the instruction at \emph{rip}
is decoded or executed. This is mutually exclusive with \texttt{\footnotesize -startlog}.

\end{itemize}

\section{\label{sec:EventLogRingBuffer}Event Log Ring Buffer}

PTLsim also maintains an event log ring buffer. Every time the core
takes some action (for instance, dispatching an instruction, executing
a store, committing a result or annulling each uop after an exception),
it writes that event to a circular buffer that contains (by default)
the last 32768 events in chronological order (oldest to newest). This
is extremely useful for debugging in cases where you want to {}``look
backwards in time'' from the point where a specific but unknown {}``bad''
event occurred, but cannot leave logging at e.g. {}``\texttt{\footnotesize -loglevel
99}'' enabled all the time (because it is far too slow and space
consuming).

The event log ring buffer must be enabled via the \texttt{\footnotesize -ringbuf}
option. This is disabled by default since it exacts a 25-40\% performance
overhead (but this is much better than the 10000\%+ overhead of full
logging).

PTLsim will always print the ring buffer to the log file whenever:

\begin{itemize}
\item Any \texttt{\footnotesize assert} statement fails within the out of
order simulator core;
\item Any fatal exception occurs;
\item At user-specified points, by inserting {}``\texttt{\footnotesize core.eventlog.print(logfile);}''
anywhere within the code;
\item Whenever the {}``\texttt{\footnotesize -ringbuf-trigger-rip} \emph{rip}''
option is used to specify a specific trigger RIP. When the last uop
at this RIP is committed, the ring buffer is printed, exposing all
events that happened over the past few thousand cycles (going backwards
in time from the cycle in which the trigger instruction committed)
\item The event log ring buffer is automatically enabled whenever \texttt{\footnotesize -loglevel}
is 6 or higher; in this case all events are logged to the logfile
after every cycle.
\end{itemize}

\section{Simulation Start Points}

Normally PTLsim starts in simulation mode at the first instruction
in the target program (or the Linux dynamic linker, assuming the program
is dynamically linked). It may be desirable to skip time-consuming
initialization parts of the program, using one of two methods.

The \texttt{\footnotesize -startrip} \emph{rip} option places a breakpoint
at \emph{rip}, then immediately switches to native mode until that
breakpoint is hit, at which point PTLsim begins simulation.

Alternatively, if the source code to the program is available, it
may be recompiled with call(s) to a special function, \texttt{\footnotesize ptlcall\_switch\_to\_sim()},
provided in \texttt{\footnotesize ptlcalls.h}. PTLsim is then started
with the \texttt{\footnotesize -trigger} option, which switches it
to native mode until the first call to the \texttt{\footnotesize ptlcall\_switch\_to\_sim()}
function, at which point simulation begins. This function, and other
special code that can be used within the target program, is described
in Section \ref{sec:TriggerMode}.


\section{Simulation Stop Points}

By default, PTLsim continues executing in simulation mode until the
target program exits on its own. However, typically programs are profiled
for a fixed number of committed x86 instructions, or until a specific
point is reached, so as to ensure an identical span of instructions
is executed on every trial, without waiting for the entire program
to finish. The following options support this behavior:

\begin{itemize}
\item \texttt{\footnotesize -stopinsns} \emph{insns} will stop the simulation
after \emph{insns} x86 instructions have committed.
\item \texttt{\footnotesize -stop} \emph{cycles} stops after \emph{cycles}
cycles have been simulated.
\item \texttt{\footnotesize -stoprip} \emph{rip} stops after the instruction
at rip is decoded and executed the first time.
\end{itemize}
PTLsim will normally switch back to native mode after finishing simulation.
If the program should be terminated instead, the \texttt{\footnotesize -exitend}
option will do so.

The node is at the root of the statistics tree (typically this only
applies to the PTLsimStats structure itself)


\section{Statistics Collection}

PTLsim supports the collection of a wide variety of statistics and
counters as it simulates your code, and can make regular or triggered
snapshots of the counters. Chapter \ref{sec:StatisticsInfrastructure}
describes this support, while Section \ref{sec:StatisticsOptions}
documents the configuration options associated with statistics collection,
including \texttt{\footnotesize -stats}, \texttt{\footnotesize -snapshot-cycles},
\texttt{\footnotesize -snapshot-now}.

\begin{lyxcode}



\end{lyxcode}

\chapter{\label{sec:PTLsimInternals}PTLsim Classic Internals}


\section{\label{sec:Injection}Low Level Startup and Injection}

\emph{Note:} This section deals with the internal operation of the
PTLsim low level code, independent of the out of order simulation
engine. If you are only interested in modifying the simulator itself,
you can skip this section.

\emph{Note:} This section does not apply to the full system PTLsim/X;
please see the corresponding sections in Part \ref{sec:PTLsimFullSystem}
instead.


\subsection{\label{sub:Injection-On-x86-64}Startup on x86-64}

PTLsim is a very unusual Linux program. It does its own internal memory
management and threading without help from the standard libraries,
injects itself into other processes to take control of them, and switches
between 32-bit and 64-bit mode within a single process image. For
these reasons, it is very closely tied to the Linux kernel and uses
a number of undocumented system calls and features only available
in late 2.6 series kernels. 

PTLsim always starts and runs as a 64-bit process even when running
32-bit threads; it context switches between modes as needed. The statically
linked \texttt{\small ptlsim} executable begins executing at \texttt{\small ptlsim\_preinit\_entry}
in \texttt{\small lowlevel-64bit.S}. This code calls \texttt{\small ptlsim\_preinit()}
in \texttt{\small kernel.cpp} to set up our custom memory manager
and threading environment before any standard C/C++ functions are
used. After doing so, the normal \texttt{\small main()} function is
invoked.

The \texttt{\small ptlsim} binary can run in two modes. If executed
from the command line as a normal program, it starts up in \emph{inject}
mode. Specifically, \texttt{\small main()} in \texttt{\small ptlsim.cpp}
checks if the \texttt{\small inside\_ptlsim} variable has been set
by \texttt{\small ptlsim\_preinit\_entry}, and if not, PTLsim enters
inject mode. In this mode, \texttt{\small ptlsim\_inject()} in \texttt{\small kernel.cpp}
is called to effectively inject the \texttt{\small ptlsim} binary
into another process and pass control to it before even the dynamic
linker gets to load the program. In \texttt{\small ptlsim\_inject()},
the PTLsim process is forked and the child is placed under the parent's
control using \texttt{\small ptrace()}. The child process then uses
\texttt{\small exec()} to start the user program to simulate (this
can be either a 32-bit or 64-bit program). 

However, the user program starts in the stopped state, allowing \texttt{\small ptlsim\_inject()}
to use \texttt{\small ptrace()} and related functions to inject either
32-bit or 64-bit boot loader code directly into the user program address
space, overwriting the entry point of the dynamic linker. This code,
derived from \texttt{\small injectcode.cpp} (specifically compiled
as \texttt{\small injectcode-32bit.o} and \texttt{\small injectcode-64bit.o})
is completely position independent. Its sole function is to map the
rest of \texttt{\small ptlsim} into the user process address space
at virtual address \texttt{\small 0x70000000} and set up a special
\texttt{\small LoaderInfo} structure to allow the master PTLsim process
and the user process to communicate. The boot code also restores the
old code at the dynamic linker entry point after relocating itself.
Finally, \texttt{\small ptlsim\_inject()} adjusts the user process
registers to start executing the boot code instead of the normal program
entry point, and resumes the user process.

At this point, the PTLsim image injected into the user process exists
in a bizarre environment: if the user program is 32 bit, the boot
code will need to switch to 64-bit mode before calling the 64-bit
PTLsim entrypoint. Fortunately x86-64 and the Linux kernel make this
process easy, despite never being used by normal programs: a regular
far jump switches the current code segment descriptor to \texttt{\small 0x33},
effectively switching the instruction set to x86-64. For the most
part, the kernel cannot tell the difference between a 32-bit and 64-bit
process: as long as the code uses 64-bit system calls (i.e. \texttt{\small syscall}
instruction instead of \texttt{\small int 0x80} as with 32-bit system
calls), Linux assumes the process is 64-bit. There are some subtle
issues related to signal handling and memory allocation when performing
this trick, but PTLsim implements workarounds to these issues.

After entering 64-bit mode if needed, the boot code passes control
to PTLsim at \texttt{\small ptlsim\_preinit\_entry}. The \texttt{\small ptlsim\_preinit()}
function checks for the special \texttt{\small LoaderInfo} structure
on the stack and in the ELF header of PTLsim as modified by the boot
code; if these structures are found, PTLsim knows it is running inside
the user program address space. After setting up memory management
and threading, it captures any state the user process was initialized
with. This state is used to fill in fields in the global \texttt{\small ctx}
structure of class \texttt{\small CoreContext}: various floating point
related fields and the user program entry point and original stack
pointer are saved away at this point. If PTLsim is running inside
a 32-bit process, the 32-bit arguments, environment and kernel auxiliary
vector array (auxv) need to be converted to their 64-bit format for
PTLsim to be able to parse them from normal C/C++ code. Finally, control
is returned to \texttt{\small main()} to allow the simulator to start
up normally.


\subsection{Startup on 32-bit x86}

The PTLsim startup process on a 32-bit x86 system is essentially a
streamlined version of the process above (Section \ref{sub:Injection-On-x86-64}),
since there is no need for the same PTLsim binary to support both
32-bit and 64-bit user programs. The injection process is very similar
to the case where the user program is always a 32-bit program.


\section{Simulator Startup}

In \texttt{\footnotesize kernel.cpp}, the \texttt{\footnotesize main()}
function calls \texttt{\footnotesize init\_config()} to read in the
user program specific configuration as described in Sections \ref{sec:RunningPTLsim}
and \ref{sec:ConfigurationOptions}, then starts up the various other
simulator subsystems. If one of the \texttt{\footnotesize -excludeld}
or \texttt{\footnotesize -startrip} options were given, a breakpoint
is inserted at the RIP address where the user process should switch
from native mode to simulation mode (this may be at the dynamic linker
entry point by default).

Finally, \texttt{\footnotesize switch\_to\_native\_restore\_context()}
is called to restore the state that existed before PTLsim was injected
into the process and return to the dynamic linker entry point. This
may involve switching from 64-bit back to 32-bit mode to start executing
the user process natively as discussed in Section \ref{sec:Injection}.

After native execution reaches the inserted breakpoint thunk code,
the code performs a 32-to-64-bit long jump back into PTLsim, which
promptly restores the code underneath the inserted breakpoint thunk.
At this point, the \texttt{\footnotesize switch\_to\_sim()} function
in \texttt{\footnotesize kernel.cpp} is invoked to actually begin
the simulation. This is done by calling \texttt{\footnotesize simulate()}
in \texttt{\footnotesize ptlsim.cpp}.

At some point during simulation, the user program or the configuration
file may request a switch back to native mode for the remainder of
the program. In this case, the \texttt{\footnotesize switch\_to\_native\_restore\_context()}
function gets called to save the statistics data store, map the PTLsim
internal state back to the x86 compatible external state and return
to the 32-bit or 64-bit user code, effectively removing PTLsim from
the loop.

While the real PTLsim user process is running, the original PTLsim
injector process simply waits in the background for the real user
program with PTLsim inside it to terminate, then returns its exit
code.


\section{\label{sec:AddressSpaceSimulation}Address Space Simulation}

PTLsim maintains the \texttt{\footnotesize AddressSpace} class as
global variable \texttt{\footnotesize asp} (see \texttt{\footnotesize kernel.cpp})
to track the attributes of each page within the virtual address space.
When compiled for x86-64 systems, PTLsim uses Shadow Page Access Tables
(SPATs), which are essentially large two-level bitmaps. Since pages
are 4096 bytes in size, each 64 kilobyte chunk of the bitmap can track
2 GB of virtual address space. In each SPAT, each top level array
entry points to a chunk mapping 2 GB, such that with 131072 top level
pointers, the full 48 bit virtual address space can typically be mapped
with under a megabyte of SPAT chunks, assuming the address space is
sparse.

When compiled for 32-bit x86 systems, each SPAT is just a 128 KByte
bitmap, with one bit for each of the 1048576 4 KB pages in the 4 GB
address space.

In the AddressSpace structure, there are separate SPAT tables for
readable pages (\texttt{\footnotesize readmap} field), writable pages
(\texttt{\footnotesize writemap} field) and executable pages (\texttt{\footnotesize execmap}
field). Two additional SPATs, \texttt{\footnotesize dtlbmap} and \texttt{\footnotesize itlbmap},
are used to track which pages are currently mapped by the simulated
translation lookaside buffers (TLBs); this is discussed further in
Section \ref{sec:TranslationLookasideBuffers}.

When running in native mode, PTLsim cannot track changes to the process
memory map made by native calls to \texttt{\footnotesize mmap()},
\texttt{\footnotesize munmap()}, etc. Therefore, at every switch from
native to simulation mode, the \texttt{\footnotesize resync\_with\_process\_maps()}
function is called. This function parses the \texttt{\footnotesize /proc/self/maps}
metafile maintained by the kernel to build a list of all regions mapped
by the current process. Using this list, the SPATs are rebuilt to
reflect the current memory map. This is absolutely critical for correct
operation, since during simulation, speculative loads and stores will
only read and write memory if the appropriate SPAT indicates the address
is accessible to user code. If the SPATs become out of sync with the
real memory map, PTLsim itself may crash rather than simply marking
the offending load or store as invalid. The \texttt{\footnotesize resync\_with\_process\_maps()}
function (or more specifically, the \texttt{\footnotesize mqueryall()}
helper function) is fairly kernel version specific since the format
of \texttt{\footnotesize /proc/self/maps} has changed between Linux
2.6.x kernels. New kernels may require updating this function.


\section{\label{sec:DebuggingHints}Debugging Hints}

When adding or modifying PTLsim, bugs will invariably crop up. Fortunately,
PTLsim provides a trivial way to find the location of bugs which silently
corrupt program execution. Since PTLsim can transparently switch between
simulation and native mode, isolating the divergence point between
the simulated behavior and what a real reference machine would do
can be done through binary search. The \texttt{\footnotesize -stopinsns}
configuration option can be set to stop simulation before the problem
occurs, then incremented until the first x86 instruction to break
the program is determined.

The out of order simulator (\texttt{\footnotesize ooocore.cpp}) includes
extensive debugging and integrity checking assertions. These may be
turned off by default for improved performance, but they can be easily
re-enabled by defining the \texttt{\footnotesize ENABLE\_CHECKS} symbol
at the top of \texttt{\footnotesize ooocore.cpp}, \texttt{\footnotesize ooopipe.cpp}
and \texttt{\footnotesize oooexec.cpp}. Additional check functions
are in the code but commented out; these may be used as well.

You can also debug PTLsim with \texttt{\small gdb}, although the process
is non-standard due to PTLsim's co-simulation architecture:

\begin{itemize}
\item Start PTLsim on the target program like normal. Notice the \texttt{\footnotesize Thread}{\footnotesize{}
}\texttt{\emph{\footnotesize N}}{\footnotesize{} }\texttt{\footnotesize is
running in XX-bit mode} message printed at startup: this is the PID
you will be debugging, not the {}``\texttt{\small ptlsim}'' process
that may also be running.
\item Start GDB and type {}``\texttt{\footnotesize attach 12345}'' if
\emph{12345} was the PID listed above
\item Type {}``\texttt{\footnotesize symbol-file ptlsim}'' to load the
PTLsim internal symbols (otherwise gdb only knows about the benchmark
code itself). You should specify the full path to the PTLsim executable
here.
\item You're now debugging PTLsim. If you run the {}``\texttt{\small bt}''
command to get a backtrace, it should show the PTLsim functions starting
at address 0x70000000.
\end{itemize}
If the backtrace does not display enough information, go to the \texttt{\footnotesize Makefile}
and enable the \char`\"{}no optimization\char`\"{} options (the \char`\"{}-O0\char`\"{}
line instead of \char`\"{}-O99\char`\"{}) since that will make more
debugging information available to you.

The {}``\texttt{\footnotesize -pause-at-startup} \emph{seconds}''
configuration option may be useful here, to give you time to attach
with a debugger before starting the simulation.


\section{\label{sec:Timing}Timing Issues}

PTLsim uses the \texttt{\footnotesize CycleTimer} class extensively
to gather data about its own performance using the CPU's timestamp
counter. At startup in \texttt{\small superstl.cpp}, the CPU's maximum
frequency is queried from the appropriate Linux kernel sysfs node
(if available) or from \texttt{\small /proc/cpuinfo} if not. Processors
which dynamically scale their frequency and voltage in response to
load (like all Athlon 64 and K8 based AMD processors) require special
handling. It is assumed that the processor will be running at its
maximum frequency (as reported by sysfs) or a fixed frequency (as
reported by \texttt{\small /proc/cpuinfo}) throughout the majority
of the simulation time; otherwise the timing results will be bogus.


\section{External Signals and PTLsim}

PTLsim can be forced to switch between native mode and sequential
mode by sending it standard Linux-style signals from the command line.
If your program is called {}``myprogram'', start it under PTLsim
and run this command from another terminal:

\begin{lyxcode}
{\footnotesize killall~-XCPU~}\emph{\footnotesize myprogram}{\footnotesize \par}
\end{lyxcode}
This will force PTLsim to switch between native mode and simulation
mode, depending on its current mode. It will print a message to the
console and the logfile when you do this. The initial mode (native
or simulation) is determined by the presence of the \texttt{\footnotesize -trigger}
option: with \texttt{\footnotesize -trigger}, the program starts in
native mode until the trigger point (if any) is reached.


\part{\label{sec:PTLsimFullSystem}PTLsim/X: Full System SMP/SMT Simulation}


\chapter{Background}


\section{Virtual Machines and Full System Simulation}

Full system simulation and virtualization has been around since the
dawn of computers. Typically \emph{virtual machine} software is used
to run \emph{guest} operating systems on a physical \emph{host} system,
such that the guest believes it is running directly on the bare hardware.
Modern full system simulators in the x86 world can be roughly divided
into two groups (this paper does not consider systems for other instruction
sets).

\emph{Hypervisors} execute most unprivileged instructions on the native
CPU at full speed, but trap privileged instructions used by the operating
system kernel, where they are emulated by hypervisor software so as
to maintain isolation between virtual machines and make the virtual
machine nearly indistinguishable from the real CPU. In some cases
(particularly on x86), additional software techniques are needed to
fully hide the hypervisor from the guest OS.

\begin{itemize}
\item \emph{Xen} \cite{Xen2Overview,Xen3,XenCambridge,XenIntroWiki,XenPerformance,XenSource}
represents the current state of the art in this field; it will be
described in great detail later on.
\item \emph{VMware} \cite{VMware} is a very well known commercial product
that allows unmodified x86 operating systems to run inside a virtual
machine. Because the x86 instruction set is not fully virtualizable,
VMware must employ x86-to-x86 binary translation techniques on kernel
code (but not user mode code) to make the virtual CPU indistinguishable
from the real CPU for compatibility reasons. These translations are
typically cached in a hidden portion of the guest address space to
improve performance compared to simply interpreting sensitive x86
instructions. While this approach is sophisticated and effective,
it exacts a heavy performance penalty on I/O intensive workloads \cite{XenPerformance}.
Interestingly, the latest microprocessors from Intel and AMD include
hardware features (Intel VT \cite{Intel-VT}, AMD SVM \cite{AMD-SVM})
to eliminate the binary translation and patching overhead. Xen fully
supports these technologies to allow running Windows and other OS's
at full speed, while VMware has yet to include full support.


VMware comes in two flavors. ESX is a true hypervisor that boots on
the bare hardware underneath the first guest OS. GSX and Workstation
use a userspace frontend process containing all virtual device drivers
and the binary translator, while the \emph{vmmon} kernel module (open
source in the Linux version) handles memory virtualization and context
switching tasks similar to Xen.

\item Several other products, including Virtual PC and Parallels, provide
features similar to VMware using similar technology.
\item \emph{KVM} (Kernel Virtual Machine) is a new hypervisor infrastructure
built into all Linux kernels after 2.6.19. It depends on the hardware
virtualization extensions (Intel VT and AMD SVM) built into modern
x86 chips, whereas Xen and VMware also support running on older processors
without special hardware support. KVM is an attractive foundation
for future virtual machine development since it's built into Linux
(so it requires far less setup work than Xen or VMware) and provides
excellent performance.
\end{itemize}
Unlike hypervisors, \emph{simulators} perform cycle accurate execution
of x86 instructions using interpreter software, without running any
guest instructions on the native CPU.

\begin{itemize}
\item \emph{Bochs} \cite{Bochs} is the most well known open source x86
simulator; it is considered to be a nearly RTL (register transfer
language) level description of every x86 behavior from legacy 16-bit
features up through modern x86-64 instructions. \emph{Bochs} is very
useful for the functional validation of real x86 microprocessors,
but it is very slow (around 5-10 MHz equivalent) and is not useful
for implementing cycle accurate models of modern uop-based out of
order x86 processors (for instance, it does not model caches, memory
latency, functional units and so on).
\item \emph{QEMU} \cite{QEMU} is similar in purpose to VMware, but unlike
VMware, it supports multiple CPU host and guest architectures (PowerPC,
SPARC, ARM, etc). QEMU uses binary translation technology similar
to VMware to hide the hypervisor's presence from the guest kernel.
However, due to its cross platform design, both kernel and user code
is passed through x86-to-x86 binary translation (even on x86 platforms)
and stored in a translation cache. Interestingly, Xen uses a substantial
amount of QEMU code to model common hardware devices when running
unmodified operating systems like Windows, but Xen still uses its
own hardware-assisted technology to actually achieve virtualization.
QEMU supports a proprietary hypervisor module to add VMware's and
Xen's ability to run user mode code natively on the CPU to reduce
the performance penalty; hence it is also in the hypervisor category.
\item \emph{Simics} \cite{Simics} is a commercial simulation suite for
modeling both the functional aspects of various x86 processors (including
vendor specific extensions) as well as user-designed plug-in models
of real hardware devices. It is used extensively in industry for modeling
new hardware and drivers, as well as firmware level debugging. Like
QEMU, Simics uses x86-to-x86 binary translation to instrument code
at a very low level while achieving good performance (though noticeably
slower than a hypervisor provides). Unlike QEMU, Simics is fully extensible
and supports a huge range of real hardware models, but it is not possible
to add cycle accurate simulation features below the x86 instruction
level, making it less useful to microarchitects (both because of technical
considerations as well as its status as a closed source product).
\item \emph{SimNow} \cite{SimNow} is an AMD simulation tool used during
the design and validation of AMD's x86-64 hardware. Like Simics, it
is a functional simulator only, but it models a variety of AMD-built
hardware devices. SimNow uses x86-to-x86 binary translation technology
similar to Simics and QEMU to achieve good performance. Because SimNow
does not provide cycle accurate timing data, AMD uses its own TSIM
trace-based simulator, derived from the K8 RTL, to do actual validation
and timing studies. SimNow is available for free to the public, albeit
as closed source.
\end{itemize}
All of these tools share one common disadvantage: they are unable
to model execution at a level below the granularity of x86 instructions,
making them unsuitable to microarchitects. PTLsim/X seeks to fill
this void by allowing extremely detailed uop-level cycle accurate
simulation of x86 and x86-64 microprocessor cores, while simultaneously
delivering all the performance benefits of true native-mode hypervisors
like Xen, selective binary translation based hypervisors like VMware
and QEMU, and the detailed hardware modeling capabilities of Bochs
and Simics.


\section{Xen Overview}

Xen \cite{Xen3,Xen2Overview,XenCambridge,XenIntroWiki,XenPerformance,XenSource}
is an open source x86 virtual machine monitor, also known as a \emph{hypervisor}.
Each virtual machine is called a {}``domain'', where domain 0 is
privileged and accesses all hardware devices using the standard drivers;
it can also create and directly manipulate other domains. Guest domains
typically do not have hardware access do not have this access; instead,
they relay requests back to domain 0 using Xen-specific virtual device
drivers. Each guest can have up to 32 VCPUs (virtual CPUs). Xen itself
is loaded into a reserved region of physical memory before loading
a Linux kernel as domain 0; other operating systems can run in guest
domains. Xen is famous for having essentially zero overhead due to
its unique and well planned design; it's possible to run a normal
workstation or server under Xen with full native performance.

Under Xen's {}``paravirtualized'' mode, the guest OS runs on an
architecture nearly identical to x86 or x86-64, but a few small changes
(critical to preserving native performance levels) must be made to
low-level kernel code, similar in scope to adding support for a new
type of system chipset or CPU manufacturer (e.g. instead of an AMD
x86-64 on an nVidia chipset, the kernel would need to support a Xen-extended
x86-64 CPU on a Xen virtual {}``chipset''). These changes mostly
concern page tables and the interrupt controller:

\begin{itemize}
\item Paging is always enabled, and any physical pages (called {}``machine
frame numbers'', or MFNs) used to form a page table must be marked
read-only (a.k.a. {}``pinned'') everywhere. Since the processor
can only access a physical page if it's referenced by some page table,
Xen can guarantee memory isolation between domains by forcing the
guest kernel to replace any writes to page table pages with special
\emph{mmu\_update()} hypercalls (a.k.a. system calls into Xen itself).
Xen makes sure each update points to a page owned by the domain before
updating the page table. This approach has essentially zero performance
loss since the guest kernel can read its own page tables without any
further indirections (i.e. the page tables point to the actual physical
addresses), and hypercalls are only needed for batched updates (e.g.
validating a new page table after a \emph{fork()} requires only a
single hypercall).

\begin{itemize}
\item Xen also supports \emph{pseudo-physical} pages, which are consecutively
numbered from 0 to some maximum (i.e. 65536 for a 256 MB domain).
This is required because most kernels (including Linux and Windows)
do not support {}``sparse'' (discontiguous) physical memory ranges
very well (remember that every domain can still address every physical
page, including those of other domains - it just can't access all
of them). Xen provides pseudo-to-machine (P2M) and machine-to-pseudo
(M2P) tables to do this mapping. However, the physical page tables
still continue to reference physical addresses and are fully visible
to the guest kernel; this is just a convenience feature.
\item Xen can save an entire domain to disk, then restore it later starting
at that checkpoint. Since Xen tracks every read-only page that's part
of some page table, it can restore domains even if the original physical
pages are now used by something else: it automatically remaps all
MFNs in every page table page it knows about (but the guest kernel
must never store machine page numbers outside of page table pages
- it's the same concept as in garbage collection, where pointers must
only reside in the obvious places).
\item Xen can migrate running domains between machines by tracking which
physical pages become dirty as the domain executes. Xen uses \emph{shadow
page tables} for this: it makes copy-on-write duplicates of the domain's
page tables, and presents these internal tables to the CPU, while
the guest kernel still thinks it's using the original page tables.
Once the migration is complete, the shadow page tables are merged
back into the real page tables (as with a save and restore) and the
domain continues as usual.
\item The memory allocation of each domain is elastic: the domain can give
any free pages back to Xen via the {}``balloon'' mechanism; these
pages can then be re-assigned to other domains that need more memory
(up to a per-domain limit).
\item Domains can share some of their pages with other domains using the
\emph{grant mechanism.} This is used for zero-copy network and disk
I/O between domain 0 and guest domains.
\end{itemize}
\item Interrupts are delivered using an \emph{event channel} mechanism,
which is functionally identical to the IO-APIC hardware on the bare
CPU (essentially it's a {}``Xen APIC'' instead of the Intel and
AMD models already supported by the guest kernel). Xen sets up a \emph{shared
info} page containing bit vectors for masked and pending interrupts
(just like an APIC's memory mapped registers), and lets the guest
kernel register an event handler function. Xen then does an upcall
to this function whenever a virtual interrupt arrives; the guest kernel
manipulates the mask and pending bits to ensure race-free notifications.
Xen automatically maps physical IRQs on the APIC to event channels
in domain 0, plus it adds its own virtual interrupts (for the usual
timer and a Xen-specific notification port; use \emph{cat /proc/interrupts}
on a Linux system under Xen to see this). When the guest domain has
multiple VCPUs, interprocessor interrupts (IPIs) are done through
the Xen event controller in a manner identical to hardware IPIs.

\begin{itemize}
\item Xen is unique in that PCI devices can be assigned to any domain, so
for instance each guest domain could have its own dedicated PCI network
card and disk controller - there's no need to relay requests back
to domain 0 in this configuration, although it only works with hardware
that supports IOMMU virtualization (otherwise it's a security risk,
since DMA can be used to bypass Xen's page table protections).
\end{itemize}
\item Xen provides the guest with additional timers, so it can be aware
of both {}``wall clock'' time as well as execution time (since there
may be gaps in the latter as other domains use the CPU); this lets
it provide a smooth interactive experience in a way systems like VMware
cannot. The timers are delivered as virtual interrupt events.
\item All other features of the paravirtualized architecture perfectly match
x86. The guest kernel can still use most x86 privileged instructions,
such as \emph{rdmsr}, \emph{wrmsr}, and control register updates (which
Xen transparently intercepts and validates), and in domain 0, it can
access I/O ports, memory mapped I/O, the normal x86 segmentation (GDT
and LDT) and interrupt mechanisms (IDT), etc. This makes it possible
to run a normal Linux distribution, with totally unmodified drivers
and software, at full native speed (we do just this on all our development
workstations and servers). Benchmarks \cite{XenPerformance} have
shown Xen to have \textasciitilde{}2-3\% performance decrease relative
to a traditional Linux kernel, where as VMware and similar solutions
yield a 20-70\% decrease under heavy I/O.
\end{itemize}
Xen also supports {}``HVM'' (hardware virtual machine) mode, which
is equivalent to what VMware \cite{VMware}, QEMU \cite{QEMU}, Bochs
\cite{Bochs} and similar systems provide: nearly perfect emulation
of the x86 architecture and some standard peripherals. The advantage
is that an uncooperative guest OS never knows it's running in a virtual
machine: Windows XP and Mac OS X have been successfully run inside
Xen in this mode. Unfortunately, this mode has a well known performance
cost, even when Xen leverages the specialized hardware support for
full virtualization in newer Intel \cite{Intel-VT} and AMD \cite{AMD-SVM}
chips. The overhead comes from the requirement that the hypervisor
still trap and emulate all sensitive instructions, whereas paravirtualized
guests can intelligently batch together requests in one hypercall
and can avoid virtual device driver overhead.


\chapter{Getting Started with PTLsim/X}

\emph{NOTE:} This part of the manual is relevant only if you are using
the full-system PTLsim/X. If you are looking for the userspace-only
version, please skip this entire part and read Part \ref{sec:PTLsimClassic}
instead.

%
\shadowbox{\begin{minipage}[t][1\totalheight]{1\columnwidth}%
\textbf{\emph{WARNING:}} PTLsim/X assumes fairly high level of familiarity
with Xen and the Linux kernel. If you have never compiled your own
Linux kernel or if you are not yet running Xen or are unsure how to
create and use domains, \textbf{STOP NOW} and become familiar with
Xen itself before attempting to use PTLsim/X. The following sections
all assume you are familiar with Xen, at least from a system administration
perspective. We cannot provide support for Xen-related issues unless
they are caused by PTLsim.%
\end{minipage}}%



\section{Building PTLsim/X}

\textbf{Prerequisites:}

\begin{itemize}
\item PTLsim/X requires a \textbf{modern 64-bit x86-64 machine}. This means
an AMD Athlon 64 / Opteron / Turion or an Intel Pentium 4 (specifically
with EM64T) or Intel Core 2\textbf{.} We do \emph{not} plan to offer
a 32-bit version of PTLsim/X due to the technical deficiencies in
32-bit x86 that make it difficult to properly implement a full system
simulator with all of PTLsim's features. Besides, 64-bit hardware
is now the standard (in some cases the only option) from all the major
x86 processor vendors and is very affordable.
\item The 64-bit requirement \emph{only} applies to the host system running
PTLsim/X. Inside the virtual machine, you are still free to use standard
32-bit Linux distributions, applications and so forth under PTLsim/X
\item PTLsim/X assumes you have root access to your machine. The PTLsim/X
hypervisor runs below Linux itself, so you must use a Xen compatible
kernel in domain 0 (more on this later).
\item We \emph{highly recommend} you use a Linux distribution already designed
to work with Xen 3.x. We use SuSE 10.2 and highly recommend it; most
other distributions now support Xen. This requirement only applies
to domain 0 - the virtual machines you'll be running can use any distribution
and do not even need to know about Xen at all (other than the kernel,
which must support Xen hypercalls and block/network drivers).
\item We have successfully built PTLsim/X with gcc 4.1.x+ (gcc 4.0.x has
documented bugs affecting some of our code).
\end{itemize}
\textbf{Quick Start Steps:}

All files listed below can be downloaded from \texttt{\footnotesize http://www.ptlsim.org/download.php}.

%
\shadowbox{\begin{minipage}[t][1\totalheight]{1\columnwidth}%
\textbf{\emph{IMPORTANT:}} The instructions below refer to specific
versions of various files (i.e., Xen hypervisor, Linux kernel, etc.).
We regularly update the versions of these files, and newer PTLsim/X
versions may not work correctly with older kernel and/or hypervisor
versions (i.e. the versions should be matching). The following instructions
are therefore for informational purposes only; always check the PTLsim
web site's download page for the latest versions of these files. The
following versions are correct as of September 20th, 2007%
\end{minipage}}%


\begin{enumerate}
\item \textbf{Set up Xen with PTLsim/X extensions:}

\begin{itemize}
\item \textbf{Download} our modified Xen source tree (\texttt{\footnotesize xen-3.1-ptlsim.tar.bz2})
from \texttt{\footnotesize http://www.ptlsim.org/download.php}. This
is the easiest way to make sure you have the correct PTLsim-compatible
version of Xen with all patches pre-applied.

\begin{itemize}
\item We also provide \texttt{\footnotesize ptlsim-xen-hypervisor.diff}
in case you want to manually apply the patches to a development version
of Xen; the patches are fairly simple and can be adapted as needed.
\end{itemize}
\item \textbf{Build and install} both the Xen hypervisor and the userspace
Xen tools:

\begin{itemize}
\item In \texttt{\footnotesize xen-3.1-ptlsim/xen}, run \texttt{\footnotesize make. }You
can optionally copy the compiled Xen hypervisor (in \texttt{\footnotesize xen/xen})
somewhere else (such as wherever your kernel and initrd files are
stored).
\item In \texttt{\footnotesize xen-3.1-ptlsim/tools}, run \texttt{\footnotesize make},
then run \texttt{\footnotesize make install}.
\end{itemize}
\item \textbf{Download} our sample kernel and modules (\texttt{\footnotesize linux-2.6.22.6-mtyrel-64bit-xen.tar.bz2})
and extract in the root directory (via \texttt{\footnotesize tar jxvf
linux-2.6.22.6-mtyrel-64bit-xen.tar.bz2}) to create \texttt{\footnotesize /lib/modules/2.6.22.6-mtyrel-64bit-xen/....}{\footnotesize \par}

\begin{itemize}
\item This is a SMP kernel based on 2.6.22.6 with the Xen patches maintained
by SuSE Linux. The complete source is in \texttt{\footnotesize linux-2.6.22-mtyrel-source.tar.gz},
if you want to recompile it.
\item This is just a sample kernel we use - PTLsim/X should work even if
you use the Xen-compatible kernel shipped with your Linux distribution
of choice. However, we recommend you run this same kernel in domain
0 as well as in the target domain under simulation, simply because
we know it works correctly and has all the latest Xen patches.
\item In addition, our kernels feature the ability to create Xen checkpoints
and initiate PTLsim actions from within the domain by writing to \texttt{\footnotesize /proc/xen/checkpoint}
and \texttt{\footnotesize /proc/xen/ptlsim}, respectively. The major
changes are in \texttt{\footnotesize linux-2.6.22-mtyrel/patches.mty/linux-2.6.22-xen-self-checkpointing.diff}
if you want to apply them to a different kernel or learn how they
work.
\end{itemize}
\item \textbf{Activate} the new Xen hypervisor and kernel:

\begin{itemize}
\item Install the new kernel and Xen hypervisor in a manner specific to
your distribution. While we cannot provide instructions for every
distribution, on SuSE, you need to run mkinitrd to collect the required
boot drivers like this:

\begin{lyxcode}
{\footnotesize mkinitrd~-k~/lib/modules/2.6.22.6-mtyrel-64bit-xen/linux}~\\
{\footnotesize{}~~~~~~~~~-i~/lib/modules/2.6.22.6-mtyrel-64bit-xen/initrd}~\\
{\footnotesize{}~~~~~~~~~-M~/lib/modules/2.6.22.6-mtyrel-64bit-xen/System.map}{\footnotesize \par}
\end{lyxcode}
\emph{IMPORTANT:} All parts of this command should be on a single
line (this manual makes long lines difficult to show)

\item Edit the GRUB bootloader configuration (usually in \texttt{\footnotesize /boot/grub/menu.lst}
on most distributions) to specify the new Xen-enabled kernel and hypervisor.
The first entry should be similar to:

\begin{lyxcode}
{\footnotesize title~Linux~2.6.22.6-mtyrel-64bit-xen}~\\
{\footnotesize kernel~(hd0,0)/project/xen-3.1-ptlsim/xen/xen~console=vga}~\\
{\footnotesize module~(hd0,0)/lib/modules/2.6.22.6-mtyrel-64bit-xen/linux~root=/dev/...}~\\
{\footnotesize module~(hd0,0)/lib/modules/2.6.22.6-mtyrel-64bit-xen/initrd}{\footnotesize \par}
\end{lyxcode}
Obviously you may need to adjust the file locations, if you're booting
from a different hard drive or compiled Xen in a location other than
\texttt{\footnotesize /project/xen-3.1-ptlsim}.

\end{itemize}
\item \textbf{Reboot}, and make sure the PTLsim/X extensions to Xen are
actually running: {}``\texttt{\footnotesize cat /sys/hypervisor/properties/capabilities}''
should list {}``\texttt{\footnotesize ptlsim}''. If this file doesn't
exist, you're not running under Xen at all.
\end{itemize}
\item \textbf{Set up sample virtual machine and disk images:}

\begin{itemize}
\item \textbf{Download} our pre-configured example disk image (\texttt{\footnotesize ptlsim-disk-image-example.tar.bz2})
and uncompress with \texttt{\footnotesize tar jxvf ptlsim-disk-image-example.tar.bz2}.
The sample scripts inside this archive assume that the files were
extracted into \texttt{\footnotesize /project/ptlsim-disk-image-example}.

\begin{itemize}
\item We recommend placing this disk image on a local hard disk rather than
NFS. However, if you're running Cluster NFS and/or are using the \texttt{\footnotesize no\_root\_squash}
NFS option, it's perfectly fine if you put the disk image on an NFS
volume.
\end{itemize}
\item You already downloaded our Xen-compatible kernel above.
\item The disk image archive contains a sample Xen configuration file (sample-xen-domain)
and some helpful scripts (e.g. \texttt{\footnotesize run-domain},
\texttt{\footnotesize restore-domain}, etc.)
\item Make sure you can create this domain {}``\texttt{\footnotesize xm
create sample-xen-domain -c}''. You should get a console with the
text {}``Welcome to the PTLsim Demo Machine''.
\end{itemize}
\item \textbf{Setup PTLsim itself:}

\begin{itemize}
\item Download the stable version of PTLsim from our web site (in \texttt{\footnotesize ptlsim-2007xxxx-rXXX.tar.gz})
and unpack this file to create the \texttt{\footnotesize ptlsim} directory.
\item Edit the PTLsim \texttt{\footnotesize Makefile} and uncomment the
{}``\texttt{\footnotesize PTLSIM\_HYPERVISOR=1}'' line to enable
full system PTLsim/X support.
\item Run \texttt{\footnotesize make}.

\begin{itemize}
\item If the build process complains about missing header files, make sure
\texttt{\footnotesize /usr/include/xen} is a symlink to \texttt{\footnotesize /project/xen-3.1-ptlsim/tools/libxc/xen}
(or wherever you put the PTLsim-modified \texttt{\footnotesize xen-3.1-ptlsim}
tree you downloaded). Delete \texttt{\footnotesize /usr/include/xen}
beforehand if needed.
\end{itemize}
\end{itemize}
\end{enumerate}

\section{\label{sec:RunningPTLsim}Running PTLsim}

PTLsim is run in domain 0 as root, for instance by using the {}``\texttt{\footnotesize sudo
ptlsim ...}'' command. The \texttt{\footnotesize -domain}{\footnotesize{}
}\texttt{\emph{\footnotesize N}} option is used to specify the domain
to access. The following scenarios show by example how this is done.


\section{Booting Linux under PTLsim}

In the following examples, we will assume the target domain is called
\texttt{\footnotesize ptlvm}.

Start your domain as follows:

\begin{lyxcode}
{\footnotesize sudo~xm~create~domainname~-{}-paused}{\footnotesize \par}

{\footnotesize sudo~xm~list}{\footnotesize \par}

{\footnotesize sudo~xm~console~domainname}{\footnotesize \par}
\end{lyxcode}
The \texttt{\footnotesize -{}-paused} option tells Xen to pause the
domain as soon as it's created, so we can run the entire boot process
under PTLsim.

The \texttt{\footnotesize xm list} command will print the domain ID
assigned to \texttt{\footnotesize ptlvm}. On our test machine, the
output looks like:

\begin{lyxcode}
{\footnotesize yourst~{[}typhoon~/project/ptlsim]~sudo~xm~create~ptlvm~-{}-paused;~sudo~xm~list;~sudo~xm~console~ptlvm;}{\footnotesize \par}

{\footnotesize Using~config~file~\char`\"{}ptlvm\char`\"{}.}{\footnotesize \par}

{\footnotesize Started~domain~ptlvm}{\footnotesize \par}

{\footnotesize Name~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ID~Mem(MiB)~VCPUs~State~~~Time(s)}{\footnotesize \par}

{\footnotesize Domain-0~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~0~~~~~7877~~~~~4~r-{}-{}-{}-{}-~~~~137.9}{\footnotesize \par}

{\footnotesize ptlvm~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~21~~~~~~128~~~~~1~-{}-p-{}-{}-~~~~~~0.0}{\footnotesize \par}
\end{lyxcode}
You may also want to give the PTLsim domain a low priority; otherwise
it may cause the system to respond slowly. This can be done by adding:

\begin{lyxcode}
{\footnotesize sudo~xm~sched-credit~-d~}\emph{\footnotesize ptlvm}{\footnotesize{}~-w~16}{\footnotesize \par}
\end{lyxcode}
Open another console and start PTLsim on this domain (using the domain
ID {}``21'' given in the example above):

\begin{lyxcode}
{\footnotesize sudo~./ptlsim~-domain~ptlvm~-logfile~ptlsim.log~-native}{\footnotesize \par}
\end{lyxcode}
The resulting output:

\begin{lyxcode}
{\footnotesize //}{\footnotesize \par}

{\footnotesize //~~PTLsim:~Cycle~Accurate~x86-64~Full~System~Simulator}{\footnotesize \par}

{\footnotesize //~~Copyright~1999-2007~Matt~T.~Yourst~<yourst@yourst.com>}{\footnotesize \par}

{\footnotesize //}{\footnotesize \par}

{\footnotesize //~~Revision~225~(2007-09-21)}{\footnotesize \par}

{\footnotesize //~~Built~Sep~~21~2007~16:21:36~on~tidalwave.lab.ptlsim.org~using~gcc-4.2}{\footnotesize \par}

{\footnotesize //~~Running~on~typhoon.lab.ptlsim.org}{\footnotesize \par}

{\footnotesize //}{\footnotesize \par}



{\footnotesize Processing~-domain~21~-logfile~ptlsim.log~-native}{\footnotesize \par}

{\footnotesize System~Information:}{\footnotesize \par}

{\footnotesize{}~~Running~on~hypervisor~version~xen-3.0-x86\_64-ptlsim}{\footnotesize \par}

{\footnotesize{}~~Xen~is~mapped~at~virtual~address~0xffff800000000000}{\footnotesize \par}

{\footnotesize{}~~PTLsim~is~running~across~1~VCPUs:}{\footnotesize \par}

{\footnotesize{}~~~~VCPU~0:~2202~MHz}{\footnotesize \par}

{\footnotesize Memory~Layout:}{\footnotesize \par}

{\footnotesize{}~~System:~~~~~~~~~~~~~~~~~524208~pages,~~~~2096832~KB}{\footnotesize \par}

{\footnotesize{}~~Domain:~~~~~~~~~~~~~~~~~~32768~pages,~~~~~131072~KB}{\footnotesize \par}

{\footnotesize{}~~PTLsim~reserved:~~~~~~~~~~8192~pages,~~~~~~32768~KB}{\footnotesize \par}

{\footnotesize{}~~Page~Tables:~~~~~~~~~~~~~~~275~pages,~~~~~~~1100~KB}{\footnotesize \par}

{\footnotesize{}~~PTLsim~image:~~~~~~~~~~~~~~407~pages,~~~~~~~1628~KB}{\footnotesize \par}

{\footnotesize{}~~Heap:~~~~~~~~~~~~~~~~~~~~~7510~pages,~~~~~~30040~KB}{\footnotesize \par}

{\footnotesize{}~~Stack:~~~~~~~~~~~~~~~~~~~~~256~pages,~~~~~~~1024~KB}{\footnotesize \par}

{\footnotesize Interfaces:}{\footnotesize \par}

{\footnotesize{}~~PTLsim~page~table:~~~~~~282898}{\footnotesize \par}

{\footnotesize{}~~Shared~info~mfn:~~~~~~~~~~4056}{\footnotesize \par}

{\footnotesize{}~~Shadow~shinfo~mfn:~~~~~~295164}{\footnotesize \par}

{\footnotesize{}~~PTLsim~hostcall:~~~~~~~~~~~~~~~~event~channel~~~~3}{\footnotesize \par}

{\footnotesize{}~~PTLsim~upcall:~~~~~~~~~~~~~~~~~~event~channel~~~~4}{\footnotesize \par}



{\footnotesize{}~~Switched~to~native~mode}{\footnotesize \par}
\end{lyxcode}
Back in the Xen console for the domain, you'll see the familiar Linux
boot messages:

\begin{lyxcode}
{\footnotesize Bootdata~ok~(command~line~is~~nousb~noide~root=/dev/hda1~xencons=ttyS~console=ttyS0)}{\footnotesize \par}

{\footnotesize Linux~version~2.6.18-mtyrel-k8-64bit-xen~(yourst@tidalwave)~(gcc~version~4.1.0~(SUSE~Linux))~\#2~Sun~Oct~8~02:29:10~EDT~2006}{\footnotesize \par}

{\footnotesize BIOS-provided~physical~RAM~map:}{\footnotesize \par}

{\footnotesize{}~Xen:~0000000000000000~-~0000000008800000~(usable)}{\footnotesize \par}

{\footnotesize No~mptable~found.}{\footnotesize \par}

{\footnotesize Built~1~zonelists.~~Total~pages:~34816}{\footnotesize \par}

{\footnotesize Kernel~command~line:~~nousb~noide~root=/dev/hda1~xencons=ttyS~console=ttyS0}{\footnotesize \par}

{\footnotesize Initializing~CPU\#0}{\footnotesize \par}

{\footnotesize PID~hash~table~entries:~1024~(order:~10,~8192~bytes)}{\footnotesize \par}

{\footnotesize Xen~reported:~2202.808~MHz~processor.}{\footnotesize \par}

{\footnotesize Console:~colour~dummy~device~80x25}{\footnotesize \par}

{\footnotesize Dentry~cache~hash~table~entries:~32768~(order:~6,~262144~bytes)}{\footnotesize \par}

{\footnotesize Inode-cache~hash~table~entries:~16384~(order:~5,~131072~bytes)}{\footnotesize \par}

{\footnotesize Software~IO~TLB~disabled}{\footnotesize \par}

{\footnotesize Memory:~123180k/139264k~available~(2783k~kernel~code,~7728k~reserved,~959k~data,~184k~init)}{\footnotesize \par}

{\footnotesize Calibrating~delay~using~timer~specific~routine..~4407.14~BogoMIPS~(lpj=2203570)}{\footnotesize \par}

{\footnotesize ...}{\footnotesize \par}

{\footnotesize NET:~Registered~protocol~family~1}{\footnotesize \par}

{\footnotesize NET:~Registered~protocol~family~17}{\footnotesize \par}

{\footnotesize VFS:~Mounted~root~(ext2~filesystem)~readonly.}{\footnotesize \par}



{\footnotesize Welcome~to~the~PTLsim~demo~machine!}{\footnotesize \par}



{\footnotesize root~{[}ptlsim~/]~cat~/proc/cpuinfo}{\footnotesize \par}
\end{lyxcode}
You'll notice how we specified the {}``\texttt{\footnotesize -native}''
option to speed up the boot process by running all code on the real
CPU rather than PTLsim's synthetic CPU model. Booting Linux within
PTLsim is slow since the kernel often executes several billion instructions
before finally presenting a command line.


\section{Running Simulations: PTLctl}

At this point, we would like to start an actual simulation run. For
purposes of illustration, this run is composed of three actions:

\begin{itemize}
\item Simulate 100 million x86 instructions using PTLsim's out of order
superscalar model
\item Simulate another 100 million using PTLsim's sequential model. The
sequential model is much faster than the out of order superscalar
model, so it's useful for testing and debugging functional issues,
as well as simply interacting with the domain. However, it does not
collect any cycle accurate timing data. Section \ref{sec:SequentialMode}
gives more information on the sequential model.
\item Return to native mode
\end{itemize}
In the first example, we will start this run from within the running
domain using \texttt{\footnotesize ptlctl} (PTLsim controller), a
program supplied with PTLsim. PTLctl is actually an example program
showing the use of PTLsim hypercalls ({}``PTL calls''), special
x86 instructions that can be used to control a domain's own simulation.
More information on the PTLcall API is in Section \ref{sec:PTLcallsFullSystem}.

To conduct this simulation, the \texttt{\footnotesize ptlctl}{\footnotesize{}
}command is used \emph{within} the running virtual machine (by typing
it at the domain's console); it is not run on the host system at all:

\begin{lyxcode}
{\footnotesize root~{[}ptlsim~/]~}\textbf{\footnotesize tar~zc~usr~lib~|~tar~ztv~>~/tmp/allfiles.txt~\&}{\footnotesize \par}

{\footnotesize {[}1]~775}{\footnotesize \par}

{\footnotesize root~{[}ptlsim~/]~}\textbf{\footnotesize ptlctl}{\footnotesize{}~}\textbf{\footnotesize -core~ooo~-stopinsns~100m~-run~:~-core~seq~-stopinsns~200m~-run~:~-native}{\footnotesize \par}

{\footnotesize Sending~flush~and~command~list~to~PTLsim~hypervisor:}{\footnotesize \par}

{\footnotesize{}~~-core~ooo~-stopinsns~100m~-run}{\footnotesize \par}

{\footnotesize{}~~-core~seq~-stopinsns~200m~-run}{\footnotesize \par}

{\footnotesize{}~~-native}{\footnotesize \par}

{\footnotesize PTLsim~returned~rc~0}{\footnotesize \par}

{\footnotesize root~{[}ptlsim~/]~}{\footnotesize \par}
\end{lyxcode}
The first command simply runs several CPU-intensive multi-threaded
processes in the background for simulation purposes (in this case,
compressing and uncompressing files in the virtual machine's filesystem).

The second \texttt{\footnotesize ptlctl} command submits the three
simulation actions to PTLsim, separated by colons ({}``\texttt{\textbf{\footnotesize :}}'').

At the PTLsim console, the following output is produced (the cycle
counters will update regularly):

\begin{lyxcode}
{\footnotesize ...}{\footnotesize \par}

{\footnotesize Breakout~request~received~from~native~mode}{\footnotesize \par}

{\footnotesize{}~~Switched~to~simulation~mode}{\footnotesize \par}

{\footnotesize Returned~from~switch~to~native:~now~back~in~sim}{\footnotesize \par}

{\footnotesize Processing~-core~ooo~-stopinsns~100m~-run}{\footnotesize \par}

{\footnotesize{}~~Completed~~~~~~~75258330~cycles,~~~~~~100000000~commits:~~~~461819~cycles/sec,~~~~795201,~insns/sec}{\footnotesize \par}

{\footnotesize Processing~-core~seq~-stopinsns~200m~-run}{\footnotesize \par}

{\footnotesize{}~~Completed~~~~~~200000000~cycles,~~~~~~200000000~commits:~~~6941302~cycles/sec,~~~6941302,~insns/sec}{\footnotesize \par}

{\footnotesize Processing~-native}{\footnotesize \par}

{\footnotesize{}~~Switched~to~native~mode}{\footnotesize \par}
\end{lyxcode}
Notice how the command list is always terminated by a final simulation
action (in this case, \texttt{\footnotesize -native}). If the command
list only had one simulation run with a fixed duration, once that
simulation ended, the domain would freeze, since PTLsim would pause
until another command arrived. However, since the domain is frozen,
the next command would \emph{never} arrive: there is no way to execute
the \texttt{\footnotesize ptlctl} program a second time if the domain
is stopped. To avoid this sort of deadlock, \texttt{\footnotesize ptlctl}{\footnotesize{}
}lets the user atomically submit batches of multiple commands as shown
ahove.

This powerful capability allows {}``self-directed'' simulation scripts
(i.e. standard shell scripts), in which \texttt{\footnotesize ptlctl}{\footnotesize{}
}is run immediately before starting a benchmark program, then \texttt{\footnotesize ptlctl}{\footnotesize{}
}is run again after the program exits to end the simulation and switch
back to native mode.


\section{PTLsim/X Options}

In Section \ref{sec:ConfigurationOptions}, the configuration options
common to both userspace PTLsim and full system PTLsim/X wer listed.
PTLsim/X also introduces a number of special options only applicable
to full system simulation:

Actions:

\begin{itemize}
\item \texttt{\footnotesize -run}{\footnotesize \par}


Start a simulation run, using the core model specified by the \texttt{\footnotesize -core}
option (the default core is {}``\texttt{\footnotesize ooo}'').

\item \texttt{\footnotesize -stop}{\footnotesize \par}


Stop the simulation run currently in progress, and wait for further
commands. This is generally issued from another console window.

\item \texttt{\footnotesize -native}{\footnotesize \par}


Switch the domain to native mode.

\item \texttt{\footnotesize -kill}{\footnotesize \par}


Kill the domain. This is equivalent to {}``\texttt{\footnotesize xm
destroy}'', but it also allows PTLsim to perform cleanup actions
and flush all files before exiting.

\end{itemize}

\section{\label{sec:LiveConfigurationUpdates}Live Updates of Configuration
Options}

PTLsim/X provides the ability to send commands and modify configuration
options in the running simulation from another console on the host
system. This is different from how the \texttt{\footnotesize ptlctl}
program is used inside the target domain to script simulations: in
this case, the commands are submitted asynchronously from the host
system.

For instance, 

\begin{lyxcode}
{\footnotesize sudo~ptlsim~-native~-domain~ptlvm}{\footnotesize \par}
\end{lyxcode}
will immediately switch the target domain back to native mode.

To reset the log level in the middle of a simulation run, use the
following:

\begin{lyxcode}
\textbf{\footnotesize sudo~ptlsim~-domain~ptlvm~-loglevel~99~:~-run}{\footnotesize \par}

{\footnotesize ptlsim:~Sending~request~'-domain~ptlvm~-loglevel~99~:~-run'~to~domain~12...OK}{\footnotesize \par}
\end{lyxcode}
(This is an example only! Using \texttt{\footnotesize -loglevel 99}
will create huge log files).

Most options (such as \texttt{\footnotesize -loglevel}, \texttt{\footnotesize -stoprip},
etc.) can be updated at any time in this manner.

To end a simulation currently in progress, use this:

\begin{lyxcode}
{\footnotesize sudo~ptlsim~-domain~ptlvm~-kill}{\footnotesize \par}
\end{lyxcode}
This will force PTLsim to cleanly exit.


\section{\label{sec:CommandScripts}Command Scripts}

PTLsim supports \emph{command scripts}, in which a file containing
a list of commands is passed on the PTLsim command line as follows:

\begin{lyxcode}
{\footnotesize sudo~./ptlsim~-domain~}\emph{\footnotesize name}{\footnotesize{}~@ptlvm.cmd}{\footnotesize \par}
\end{lyxcode}
where \texttt{\footnotesize ptlvm.cmd} (specified following the {}``\texttt{\footnotesize @}''
operator) contains the example lines:

\begin{lyxcode}
{\footnotesize \#~Configuration~options:}{\footnotesize \par}

{\footnotesize -logfile~ptlsim.log~-loglevel~4~-stats~ptlsim.stats~-snapshot-cycles~10m}{\footnotesize \par}

{\footnotesize \#~Run~the~simulation}{\footnotesize \par}

{\footnotesize -core~seq~-run~-stopinsns~20m}{\footnotesize \par}

{\footnotesize -core~ooo~-run~-stopinsns~100m}{\footnotesize \par}

{\footnotesize -native~~~~~~\#~All~done~(switch~to~native~mode)}{\footnotesize \par}
\end{lyxcode}
These commands are executed by PTLsim one at a time, waiting until
the previous command completes before starting the next. Notice the
use of comments (starting with {}``\texttt{\footnotesize \#}''),
and how configuration options can be spread across lines if desired.
This mode is very useful for specifying breakpoints using \texttt{\footnotesize -stoprip}
and similar options; when the target RIP is reached, the simulation
stops and the next command in the command list is executed.

Command scripts can be nested (i.e. a script can itself include other
scripts using \texttt{\footnotesize @scriptname}). When multiple commands
are given on the command line separated by colons ({}``\texttt{\textbf{\footnotesize :}}''),
any \texttt{\footnotesize @scriptname} clauses are processed after
all other commands on the command line.


\section{Working with Checkpoints}

%
\shadowbox{\begin{minipage}[t][1\totalheight]{1\columnwidth}%
We maintain a tutorial on how to set up checkpoints and perform advanced
checkpointing techniques at \texttt{\footnotesize http://www.ptlsim.org/capswiki/index.php/SPEC\_2006}.
Note that this address is subject to change.%
\end{minipage}}%


Xen provides the ability to capture the state of a domain into a \emph{checkpoint
file} stored on disk. PTLsim can leverage this capability to start
simulation from a checkpoint, avoiding the need to go through the
entire boot process, and allowing precisely reproducable results across
multiple simulation runs.

To create a checkpoint, boot the domain in native mode without PTLsim
running, and bring the domain to the point where you would like to
begin simulation. Then, in another console, run:

\begin{lyxcode}
{\footnotesize sudo~xm~save~ptlvm~/tmp/ptlvm.img}{\footnotesize \par}
\end{lyxcode}
If you're using our sample disk images, this command will pause until
you do the following from \emph{within} the domain:

\begin{lyxcode}
{\footnotesize echo~checkpoint~>~/proc/xen/checkpoint}{\footnotesize \par}
\end{lyxcode}
This facility allows very precise checkpoint placement, even by writing
to this special file from within a benchmark.

To restore the domain to that checkpoint, run:

\begin{lyxcode}
{\footnotesize sudo~xm~restore~/tmp/ptlvm.img~-{}-paused}{\footnotesize \par}

{\footnotesize sudo~xm~list}{\footnotesize \par}

{\footnotesize sudo~xm~console~ptlvm}{\footnotesize \par}
\end{lyxcode}
PTLsim can then be started in the normal manner, by specifying \texttt{\footnotesize -domain
domainname.} If the checkpoint was made while the domain waited for
input (e.g. at a shell command line), you may have to press a few
keys to get any response from its console.

To exit PTLsim, use {}``\texttt{\footnotesize sudo ptlsim -kill -domain
X}'' from another console. To abort PTLsim immediately, use Ctrl+C
on the ptlsim process, then type {}``\texttt{\footnotesize xm kill
ptlvm}'' to destroy the domain.


\section{\label{sec:TimeDilation}The Nature of Time}

Full system simulation poses some difficult philosophical questions
about the nature of time itself and the relativistic phenomenon of
{}``time dilation''. Specifically, if a simulator runs X times slower
than the native CPU, both external interrupts and timer interrupts
should theoretically be generated X times slower than in the real
world. This is critical for obtaining accurate simulation results:
for events like network traffic, if a real network device fed interrupts
into the domain in realtime, and the simulator injected these interrupts
into the simulation at the same rate, they would appear to arrive
thousands of times faster than any physical network interface could
deliver them. This can easily result in a livelock situation not possible
in a real machine; at the very least it will deliver misleading performance
results.

On the other hand, interacting with a domain running at the {}``correct''
rate according to its own simulated clock can be unpleasant for users.
For instance, if the {}``\texttt{\footnotesize sleep 1}'' command
is run in a Linux domain under PTLsim, instead of sleeping for 1 second
of wall clock time (as perceived by the user), the domain will wait
until 1 billion cycles have been fully simulated (assuming the simulated
processor frequency is 1 GHz). This is because PTLsim keys interrupt
delivery and all timers to the simulated cycle number in which the
interrupt should arrive (based on the core clock frequency). In addition
to being annoying, this behavior will massively confuse network applications
that rely on precise timing information: a TCP/IP endpoint outside
the domain will not expect packets to arrive thousands of times slower
than its own realtime clock expects, resulting in retransmissions
and timeouts that would never occur if both endpoints were inside
the same {}``time dilated'' domain.

Rather than attempt to solve this philosophical dilemma, PTLsim allows
users to choose the options that best suit their simulation accuracy
needs. The following options control the notion of time inside the
simulation:

\begin{itemize}
\item \texttt{\footnotesize -corefreq}{\footnotesize{} }\texttt{\emph{\footnotesize Hz}}{\footnotesize \par}


Specify the CPU core frequency (in Hz) reported to the domain. To
specify a 2.4 GHz core, use {}``\texttt{\footnotesize -corefreq 2400m}''.
This option is used to calculate the number of cycles between timer
interrupts, as described below.

\emph{NOTE:} If you plan on switching the domain between simulation
and native mode, we strongly recommend avoiding this option, to allow
the host machine frequency to match the simulated frequency.

\item \texttt{\footnotesize -timerfreq}{\footnotesize{} }\texttt{\emph{\footnotesize Hz}}{\footnotesize \par}


Specify the timer interrupt frequency in interrupts per second. By
default, 100 interrupts per second are used, since this is the standard
for Linux kernels.

\textbf{\emph{Hint:}} if keyboard interaction with the domain seems
slow or sluggish, this is because Linux only flushes console buffers
to the screen at every clock tick. Specifying \texttt{\footnotesize -timerfreq
1000} will greatly improve interactive response at the expense of
more interrupt overhead.

\item \texttt{\footnotesize -pseudo-rtc}{\footnotesize \par}


By default, the realtime clock reported to the domain is the current
time of day. This option forces the clock to reset to whatever time
the domain's checkpoint (if any) was created. This may allow better
cycle accurate reproducibility of random number generators, for instance.

\item \texttt{\footnotesize -realtime}{\footnotesize \par}


PTLsim normally delivers all interrupts at the time dilated rate,
as described above. While this provides the most realistic simulation
accuracy, it may be undesirable for some applications, particularly
in networking. The \texttt{\footnotesize -realtime} option delivers
external interrupts to the domain as soon as they arrive at PTLsim's
interrupt handler; they are not deferred. The realtime clock reported
to the domain is also not dilated; it is locked to the current wall
clock time. This option does not affect the timer interrupt frequency;
use the \texttt{\footnotesize -timerfreq} option to directly manipulate
this.

\item \texttt{\footnotesize -maskints}{\footnotesize \par}


Do not allow \emph{any} external interrupts or events to reach the
domain; only the timer interrupt is delivered at the specified rate
by PTLsim. This mode is necessary to provide guaranteed reproducable
cycle accurate behavior across runs; it eliminates almost all non-deterministic
events (like outside device interrupts) from the simulation. However,
it is not very practical, since disk and network access is impossible
in this mode (since the Xen disk and network drivers could never wake
up the domain when data arrives). This mode is most useful for debugging
starting at a checkpoint, or when using a ramdisk with pre-scripted
boot actions.

\end{itemize}

\section{Other Options}

PTLsim/X has a few additional options related to full system simulation:

\begin{itemize}
\item \texttt{\footnotesize -reservemem}{\footnotesize{} }\texttt{\emph{\footnotesize M}}{\footnotesize \par}


Reserves \emph{M} megabytes of physical memory for PTLsim and its
translation cache. The default is 32 MB; the valid range is from 16
MB to 512 MB. See Chapter \ref{sec:PTLsimXArchitectureDetails} for
details.

\end{itemize}
All other options in Section \ref{sec:ConfigurationOptions} (unless
otherwise noted) are common to both userspace PTLsim and full system
PTLsim/X.


\chapter{\label{sec:PTLsimXArchitectureDetails}PTLsim/X Architecture Details}

The following sections provide insight into the internal architecture
of full system PTLsim/X, and how a simulator is built to run on the
bare hardware. It is not necessary to understand this information
to work with or customize machine models in PTLsim, but it may still
be fascinating to those working with the low level infrastructure
components.


\section{Basic PTLsim/X Components}

PTLsim/X works in a conceptually similar manner to the normal userspace
PTLsim: the simulator is {}``injected'' into the target user process
address space and effectively becomes the CPU executing the process.
PTLsim/X extends this concept, but instead of a process, the core
PTLsim code runs on the bare hardware and accesses the same physical
memory pages owned by the guest domain. Similarly, each VCPU is {}``collapsed''
into a context structure within PTLsim when simulation begins; each
context is then copied back onto the corresponding physical CPU context(s)
when native mode is entered.

PTLsim/X consists of three primary components: the modified Xen hypervisor,
the PTLsim monitor process, and the PTLsim core.


\subsection{Xen Modifications}

The Xen hypervisor requires some modifications to work with PTLsim.
Specifically, several new major hypercalls were added:

\begin{itemize}
\item \texttt{\footnotesize XEN\_DOMCTL\_contextswap}{\footnotesize{} }atomically
swaps all context information in all VCPUs of the target domain, saving
the old context and writing in a new context. In addition to per-VCPU
data (including all registers and page tables), the shared info page
is also swapped. This is done as a hypercall so as to eliminate race
conditions between the hypervisor, PTLsim monitor process in domain
0 and the target domain. The domain is first de-scheduled from all
physical CPUs in the host system, the old context is saved, the new
context is validated and written, and finally the paused domain wakes
up to the new context.
\item \texttt{\footnotesize MMUEXT\_GET\_GDT\_TEMPLATE} gets the x86 global
descriptor table (GDT) page Xen transparently maps into the \texttt{\footnotesize FIRST\_RESERVED\_GDT\_PAGE
gdt\_frames{[}]} slot. PTLsim needs this data to properly resolve
segment references.
\item \texttt{\footnotesize MMUEXT\_QUERY\_PAGES} queries the page type
and reference count of a given guest MFN.
\item \texttt{\footnotesize VCPUOP\_set\_breakout\_insn\_action} tells the
hypervisor about a special \emph{breakout instruction}. This is a
normally undefined x86 instruction that the \texttt{\footnotesize ptlctl}
program (and PTL calls from user code) can use to request services
from PTLsim. The hypervisor uses the x86 invalid opcode trap to intercept
this instruction, and in response it may perform several actions,
including pausing the domain and sending an interrupt to domain 0
for the PTLsim monitor process to receive. This is the mechanism by
which a domain operating in native mode can request a switch back
into simulation mode.
\item \texttt{\footnotesize VCPUOP\_set\_timestamp\_bias}{\footnotesize{}
}is used to virtualize the processor timestamp counter (TSC) read
by the x86 \texttt{\footnotesize rdtsc} instruction. This support
is needed to ensure a seamless transition between simulation mode
and native mode without the target domain noticing any cycles are
missing. Since PTLsim runs much slower than the native CPU, a negative
bias must be applied to the TSC to provide timing continuity when
returning to native mode. The hypervisor will trap \texttt{\footnotesize rdtsc}
instructions and emulate them when a bias is in effect.
\end{itemize}
These changes are provided by \texttt{\footnotesize ptlsim-xen-hypervisor.diff}
as described in the installation instructions.


\subsection{PTLsim Monitor (PTLmon)}

The PTLsim monitor (\emph{ptlmon.cpp}) is a normal Linux program that
runs in domain 0 with root privileges. After connecting to the specified
domain, it increases the domain's memory reservation so as to reserve
a range of physical pages for PTLsim (by default, 32 MB of physical
memory). PTLmon maps all these reserved pages into its own address
space, and loads the real PTLsim core code into these pages. The PTLsim
core is linked separately as \emph{ptlxen.bin}, but is then linked
as a binary object into the final self-contained \emph{ptlsim} executable.
PTLmon then builds page tables to map PTLsim space into the target
domain. Finally, PTLmon fills in various other fields in the boot
info page (including a pointer to the \emph{Context} structures (a
modified version of Xen's \emph{vcpu\_context\_t}) holding the interrupted
guest's state for each of its VCPUs), prepares the initial registers
and page tables to map PTLsim's code, then unmaps all PTLsim reserved
pages except for the first few pages (as shown in Table \ref{table:MemLayout}).
This is required since the monitor process cannot have writable references
to any of PTLsim's pages or PTLsim may not be able to pin those pages
as page table pages. At this point, PTLmon atomically restarts the
domain inside PTLsim using the new \texttt{\footnotesize contextswap}
hypercall. The old context of the domain is thus available for PTLsim
to use and update via simulation.

PTLmon also sets up two event channels: the \emph{hostcall} channel
and the \emph{upcall} channel. PTLsim notifies the monitor process
in domain 0 via the \emph{hostcall} event channel whenever it needs
to access the outside world. Specifically, PTLsim will fill in the
\emph{bootpage.hostreq} structure with parameters to a standard Linux
system calls, and will place any larger buffers in the \emph{transfer
page} (see Table \ref{table:MemLayout}) visible to both PTLmon and
PTLsim itself. PTLsim will then notify the \emph{hostcall} channel's
port. The domain 0 kernel will then forward this notification to PTLmon,
which will do the system call on PTLsim's behalf (while PTLsim remains
blocked in the \texttt{\footnotesize synchronous\_host\_call()} function).
PTLmon will then notify the hostcall port in the opposite direction
(waking up PTLsim) when the system call is complete. This is very
similar to a remote procedure call, but using shared memory. It allows
PTLsim to use standard system calls (e.g. for reading and writing
log files) without modification, yet remains suitable for a bare-metal
embedded environment.

PTLmon can also use the \emph{upcall} channel to interrupt PTLsim,
for instance to switch between native and simulation mode, trigger
a snapshot, or request that PTLsim update its internal parameters.
The PTLmon process sets up a socket in \texttt{\footnotesize /tmp/ptlsim-domain-XXX}
and waits for requests on this socket. The user can then run the \texttt{\footnotesize ptlsim}
command again, which will connect to this socket and tell the main
monitor process for the domain to enqueue a text string (usually the
command line parameters to \texttt{\footnotesize ptlsim}) and send
an interrupt to PTLsim on the \emph{upcall} channel. In response,
PTLsim uses the \texttt{\footnotesize ACCEPT\_UPCALL} hostcall to
read the enqueued command line, then parses it and acts on any listed
actions or parameter updates.

It should be noted that this design allows live configuration updates,
as described in Section \ref{sec:LiveConfigurationUpdates}.


\section{PTLsim Core}

PTLsim runs directly on the {}``bare metal'' and has no access to
traditional OS services except through the DMA and interrupt based
host call requests described above. Execution begins in \emph{ptlsim\_init()}
in \emph{ptlxen.cpp}. PTLsim first sets up its internal memory management
(page pool, slab allocator, extent allocator in \emph{mm.cpp} as described
in Section \ref{sec:MemoryManager}) using the initial page tables
created by PTLmon in conjunction with the modified Xen hypervisor.
PTLsim owns the virtual address space range starting at \texttt{\footnotesize 0xffffff0000000000}
(i.e. x86-64 PML4 slot 510, of $2^{39}$ bytes). This memory is mapped
to the physical pages reserved for PTLsim. The layout is shown in
Table \ref{table:MemLayout} (assuming 32 MB is allocated for PTLsim):

\begin{center}
%
\begin{table}
\caption{\label{table:MemLayout}Memory Layout for PTLsim Space}


\begin{tabular}{|r|r|l|}
\hline 
Page & Size & Description\tabularnewline
\hline
\hline 
\multicolumn{3}{|c|}{(Pages below this point are shared by PTLmon in domain 0 and PTLsim
in the target domain)}\tabularnewline
\hline 
0 & 4K & Boot info page and ptlxen.bin ELF header (see \emph{xc\_ptlsim.h}
and \emph{ptlxen.h} for the structures)\tabularnewline
\hline 
1 & 4K & Hypercall entry points (filled in by Xen)\tabularnewline
\hline 
2 & 4K & Shared info page for the domain\tabularnewline
\hline 
3 & 4K & Shadow shared info page (as seen by guest)\tabularnewline
\hline 
4 & 4K & Transfer page (for DMA between PTLmon in dom0 and target)\tabularnewline
\hline 
5 & 128K & 32 VCPU Context structure pages\tabularnewline
\hline
\hline 
\multicolumn{3}{|c|}{(Pages below this point are private to PTLsim in the target domain)}\tabularnewline
\hline 
37 & \textasciitilde{}2M & PTLsim binary\tabularnewline
\hline 
- & \textasciitilde{}28M & PTLsim heap (page pool, slab allocator, extent allocator)\tabularnewline
\hline 
- & \textasciitilde{}256K & PTLsim stack\tabularnewline
\hline 
... & \textasciitilde{}64K & Page tables mapping 32 MB PTLsim space\tabularnewline
\hline 
... & \textasciitilde{}1MB & Page tables (level 1) mapping all physical pages (reserved but not
filled in)\tabularnewline
\hline 
(32MB) & \textasciitilde{}64K & Higher level page tables (levels 4/3/2) pointing to other tables\tabularnewline
\hline
\end{tabular}
\end{table}

\par\end{center}

Starting at virtual address \texttt{\footnotesize 0xfffffe0000000000}
(i.e. x86-64 PML4 slot 508, of $2^{40}$ bytes), space is reserved
to map all physical memory pages (MFNs) belonging to the guest domain.
This mapping is sparse, since only a subset of the physical pages
are accessible by the guest. When PTLsim is first injected into a
domain, this space starts out empty. As various parts of PTLsim attempt
to access physical addresses, PTLsim's internal page fault handler
will map physical pages into this space. Normally all pages are mapped
as writable, however Xen may not allow writable mappings to some types
of pinned pages (L1/L2/L3/L4 page table pages, GDT pages, etc.). Therefore,
if the writable mapping fails, PTLsim tries to map the page as read
only. PTLsim monitors memory management related hypercalls as they
are simulated and remaps physical pages as read-only or writable if
and when they are pinned or unpinned, respectively. When PTLsim switches
back to native mode, it quickly unmaps all guest pages, since we cannot
hold writable references to any pages the guest kernel may later attempt
to pin as page table pages. This unmapping is done very quickly by
simply clearing all present bits in the physical map's L2 page table
page; the PTLsim page fault handler will re-establish the L2 entries
as needed.


\section{Implementation Details}


\subsection{\label{sub:FullSystemPageTranslation}Page Translation}

The Xen-x86 architecture always has paging enabled, so PTLsim uses
a simulated TLB for all virtual-to-physical translations. Each TLB
entry has x86 accessed and dirty bits; whenever these bits transition
from 0 to 1, PTLsim must walk the page table tree and actually update
the corresponding PTE's accessed and/or dirty bit. Since page table
pages are mapped read-only, our modified \emph{update\_mmu} hypercall
is used to do this. TLB misses are serviced in the normal x86 way:
the page tables are walked starting from the MFN in CR3 until the
page is resolved. This is done by the \texttt{\footnotesize Context.virt\_to\_pte()}
method, which returns the L1 page table entry (PTE) providing the
physical address and accumulated permissions (x86 has specific rules
for deriving the effective writable/executable/supervisor permissions
for each page). Internally, the \texttt{\footnotesize page\_table\_walk()}
function actually follows the page table tree, but PTLsim maintains
a small 16-entry direct mapped cache (like a TLB) to accelerate repeated
translations (this is not related to any true TLB maintained by specific
cores). The \texttt{\footnotesize pte\_to\_ptl\_virt()} function then
translates the PTE and original virtual address into a pointer PTLsim
can actually access (inside PTLsim's mapping of the domain's physical
memory pages). The software TLB is also flushed under the normal x86
conditions (\texttt{\footnotesize MOV CR3}, \texttt{\footnotesize WBINVD},
\texttt{\footnotesize INVLPG}, and Xen hypercalls like \texttt{\footnotesize MMUEXT\_NEW\_BASE\_PTR}).
Presently TLB support is in \texttt{\footnotesize dcache.cpp}; the
features above are incorporated into this TLB. In addition, \texttt{\footnotesize Context.copy\_from\_user()}
and \texttt{\footnotesize Context.copy\_to\_user()} functions are
provided to walk the page tables and copy user data to or from a buffer
inside PTLsim.

In 32-bit versions of Xen, the x86 protection ring mechanism is used
to allow the guest kernel to run at ring 1 while guest userspace runs
in ring 3; this allows the {}``supervisor'' bit in PTEs to retain
its traditional meaning. However, in its effort to clean up legacy
ISA features, x86-64 has no concept of privilege rings (other than
user/supervisor) or segmentation. This means the supervisor bit in
PTEs is never set (only Xen internal pages not accessible to guest
domains have this bit set). Instead, Xen puts the kernel in a separate
address space from user mode; the top-level L4 page table page for
kernel mode points to both kernel-only and user pages. Fortunately,
Xen uses TLB global bits and other x86-64 features to avoid much of
the context switch overhead from this approach. PTLsim does not have
to worry about this detail during virtual-to-PTE translations: it
just follows the currently active page table based on physical addresses
only.


\subsection{Exceptions}

Under Xen, the \texttt{\footnotesize set\_trap\_table()} hypercall
is used to specify an array of pointers to exception handlers; this
is equivalent to the x86 LIDT (load interrupt descriptor table) instruction.
Whenever we switch from native mode to simulation mode, PTLmon copies
this array back into the \texttt{\footnotesize Context.idt{[}]} array.
Whenever PTLsim detects an exception during simulation, it accesses
\texttt{\footnotesize Context.idt{[}vector\_id]} to determine where
the pipeline should be restarted (CS:RIP). In the case of page faults,
the simulated CR2 is loaded with the faulting virtual address. It
then constructs a stack frame equivalent to Xen's structure (i.e.
\texttt{\footnotesize iret\_context}) at the stack segment and pointer
stored in \texttt{\footnotesize Context.kernel\_sp} (previously set
by the \texttt{\footnotesize stack\_switch()} hypercall, which replaces
the legacy x86 TSS structure update). Finally, PTLsim propagates the
page fault to the selected guest handler by redirecting the pipeline.
This is essentially the same work performed within Xen by the \texttt{\footnotesize create\_bounce\_frame()}
function, \texttt{\footnotesize do\_page\_fault()} (or its equivalent)
and \texttt{\footnotesize propagate\_page\_fault()} (or its equivalent);
all the same boundary conditions must be handled.


\subsection{System Calls and Hypercalls}

On 64-bit x86-64, the \texttt{\footnotesize syscall} instruction has
a different meaning depending on the context in which it is executed.
If executed from userspace, \emph{syscall} arranges for execution
to proceed directly to the guest kernel system call handler (in \texttt{\footnotesize Context.syscall\_rip}).
This is done by the \texttt{\footnotesize assist\_syscall()} microcode
handler. A similar process occurs when a 32-bit application uses {}``\texttt{\footnotesize int
0x80}'' to make system calls, but in this case, \texttt{\footnotesize Context.propagate\_x86\_exception()}
is used to redirect execution to the trap handler registered for that
virtual software interrupt.

If executed from kernel space, the \texttt{\footnotesize syscall}
instruction is interpreted as a hypercall into Xen itself. PTLsim
emulates all Xen hypercalls. In many simple cases, PTLsim handles
the hypercall all by itself, for instance by simply updating its internal
tables. In other cases, the hypercall can safely be passed down to
Xen without corrupting PTLsim's internal state. We must be very careful
as to which hypercalls are passed through: for instance, before updating
the page table base, we must ensure the new page table still maps
PTLsim and the physical address space before we allow Xen to update
the hardware page table base. These cases are all documented in the
comments of \texttt{\footnotesize handle\_xen\_hypercall()}\emph{.}

Note that the definition of {}``user mode'' and {}``kernel mode''
is maintained by Xen itself: from the CPU's viewpoint, both modes
are technically userspace and run in ring 3.

An interesting issue arises when PTLsim passes hypercalls through
to Xen: some buffers provided by the guest kernel may reside in virtual
memory not mapped by PTLsim. Normally PTLsim avoids this problem by
copying any guest buffers into its own address space using \texttt{\footnotesize Context.copy\_from\_user()},
then copying the results back after the hypercall. However, to avoid
future complexity, PTLsim currently switches its own page tables every
time the guest requests a page table switch, such that Xen can see
all guest kernel virtual memory as well as PTLsim itself. Obviously
this means PTLsim injects its two top-level page table slots into
every guest top level page table. For multi-processor simulation,
PTLsim needs to swap in the target VCPU's page table base whenever
it forwards a hypercall that depends on virtual addresses.


\subsection{Event Channels}

Xen delivers outside events, virtual interrupts, IPIs, etc. to the
domain just like normal, except they are redirected to a special PTLsim
upcall handler stub (in \texttt{\footnotesize lowlevel-64bit-xen.S}).
The handler checks which events are pending, and if any events (other
than the PTLsim hostcall and upcall events) are pending, it sets a
flag so the guest's event handler is invoked the next time through
the main loop. This process is equivalent to exception handling in
terms of the stack frame setup and call/return sequence: the simulated
pipeline is simply redirected to the handler address. It should be
noted that the PTLsim handler does not set or clear any mask bits
in the shared info page, since it's the (emulated) guest OS code that
should actually be doing this, not PTLsim. The only exception is when
the event in question is on the hostcall port or the upcall port;
then PTLsim handles the event itself and never notifies the guest.


\subsection{Privileged Instruction Emulation}

Xen lets the guest kernel execute various privileged instructions,
which it then traps and emulates with internal hypercalls. These are
the same as in Xen's arch/x86/traps.c: CLTS (FPU task switches), MOV
from CR0-CR4 (easy to emulate), MOV to and from DR0-DR7 (get or set
debug registers), RDMSR and WRMSR (mainly to set segment bases). PTLsim
decodes and executes these instructions on its own, just like any
other x86 instruction.


\section{\label{sec:PTLcallsFullSystem}PTLcalls}

PTLsim defines the special x86 opcode \texttt{\footnotesize 0x0f37}
as a breakout opcode. It is undefined in the normal x86 instruction
set, but when executed by any code running under PTLsim, it can be
used to enqueue commands for PTLsim to execute.

The \texttt{\footnotesize ptlctl} program uses this facility to switch
from native mode to simulation mode as follows. Whenever PTLsim is
about to switch back to native mode, it uses the \texttt{\footnotesize VCPUOP\_set\_breakout\_insn\_action}
to specify the opcode bytes to intercept. When the hypervisor sees
an invalid instruction matching \texttt{\footnotesize 0x0f37}, it
freezes the domain and sends an event channel notification to domain
0. This event channel is read by PTLmon, which then uses the \texttt{\footnotesize contextswap}
hypercall to switch back into PTLsim inside the domain. PTLsim then
processes whatever command caused the switch back into simulation
mode.

While executing \emph{within} simulation mode, this is not necessary:
since PTLsim is in complete control of the execution of each x86 instruction,
it simply defines microcode to handle \texttt{\footnotesize 0x0f37}
instead of triggering an invalid opcode exception. This microcode
branches into PTLsim, which uses the \texttt{\footnotesize PTLSIM\_HOST\_INJECT\_UPCALL}
hostcall to add the command(s) to the command queue. The queue is
maintained inside PTLmon so as to ensure synchronization between commands
coming from the host and commands from within the domain arriving
via PTLcalls. The queue is typically flushed before adding new commands
in this manner: otherwise, it would be impossible to get immediate
results using \texttt{\footnotesize ptlctl}.

All PTL calls are defined in \texttt{\footnotesize ptlcalls.h}, which
simply collects the call's arguments and executes opcode \texttt{\footnotesize 0x0f37}
as if it were a normal x86 \texttt{\footnotesize syscall} instruction:

\begin{itemize}
\item \texttt{\footnotesize ptlcall\_multi\_enqueue(const char{*} list{[}],
size\_t length)} enqueues a list of commands to process in sequence
\item \texttt{\footnotesize ptlcall\_multi\_flush(const char{*} list{[}],
size\_t length)} flushes the queue before adding the commands
\item \texttt{\footnotesize ptlcall\_single\_enqueue(const char{*} command)}
adds one command to the end of the queue
\item \texttt{\footnotesize ptlcall\_single\_flush(const char{*} command)}
immediately flushes the queue and processes the specified command
\item \texttt{\footnotesize ptlcall\_nop()} is a simple no-operation command
used to get PTLsim's attention
\item \texttt{\footnotesize ptlcall\_version()} returns version information
about the running PTLsim hypervisor.
\end{itemize}
The \texttt{\footnotesize ptlcall} opcode \texttt{\footnotesize 0x0f37}
can be executed from both user mode and kernel mode, since it may
be desirable to switch simulation options from a userspace program.
This would be impossible if \texttt{\footnotesize wrmsr}{\footnotesize{}
}(the traditional method) were used to effect PTLsim operations.


\section{Event Trace Mode}

In Section \ref{sec:TimeDilation}, we discussed the philosophical
question of how to accurately model the timing of external events
when cycle accurate simulation runs thousands of times slower than
the outside world expects. To solve this problem, PTLsim/X offers
\emph{event trace} mode.

First, the user saves a checkpoint of the target domain, then instructs
PTLsim to enter \emph{event record} mode. The domain is then used
interactively in native mode at full speed, for instance by starting
benchmarks and waiting for their completion. In the background, PTLsim
taps Xen's trace buffers to write any events delivered to the domain
into an event trace file. {}``Events'' refer to any time-dependent
outside stimulus delivered to the domain, such as interrupts (i.e.
Xen event channel notifications) and DMA traffic (i.e. the full contents
of any grant pages from network or disk I/O transferred into the domain).
Each event is timestamped with the relative cycle number (timestamp
counter) in which it was delivered, rather than the wall clock time.
When the benchmarks are done, the trace mode is terminated and recording
stops.

The user then restores the domain from the checkpoint and re-injects
PTLsim, but this time PTLsim reads the event trace file, rather than
responding to any outside events Xen may deliver to the domain while
in simulation mode. Whenever the timestamp of the event at the head
of the trace file matches the current simulation cycle, that event
is injected into the domain. PTLsim does this by setting the appropriate
pending bits in the shared info page, and then simulates an upcall
to the domain's shared info handler (i.e. by restarting the simulated
pipeline at that RIP). Since the event channels used by PTLsim and
those of the target domain may interfere, PTLsim maintains a shadow
shared info page that's updated instead; whenever the simulated load/store
pipeline accesses the real shared info page's physical address, the
shadow page is used in its place. In addition, the wall clock time
fields in the shadow shared info page are regularly updated by dividing
the simulated cycle number by the native CPU clock frequency active
during the record mode (since the guest OS will have recorded this
internally in many places).

This scheme does require some extra software support, since we need
to be able to identify which pages the outside source has overwritten
with incoming data (i.e. as in a virtual DMA). The console I/O page
is actually a guest page that domain 0 maps in \emph{xenconsoled};
this is easy to identify and capture. The network and block device
pages are typically grant pages; the domain 0 Linux device drivers
must be modified to let PTLsim know what pages will be overwritten
by outside sources.


\section{Multiprocessor Support}

PTLsim/X is designed from the ground up to support multiple VCPUs
per domain. The \texttt{\footnotesize contextof(vcpuid)} function
returns the Context structure allocated for each VCPU; this structure
is passed to all functions and assists dealing with the domain. It
is the responsibility of each core (e.g. sequential core, out of order
core, user-designed cores, etc.) to update the appropriate context
structure according to its own design.

VCPUs may choose to block by executing an appropriate hypercall (\texttt{\footnotesize sched\_block},
\texttt{\footnotesize sched\_yield}, etc.), typically suspending execution
until an event arrives. PTLsim cores can simulate this by checking
the \texttt{\footnotesize Context.running} field; if zero, the corresponding
VCPU is blocked and no instructions should be processed until the
\texttt{\footnotesize running} flag becomes set, such as when an event
arrives. In realtime mode (where Xen relays real events like timer
interrupts back to the simulated CPU), events and upcalls may be delivered
to other VCPUs than the first VCPU which runs PTLsim; in this case,
PTLsim must check the pending bitmap in the shared info page and simulate
upcalls within the appropriate VCPU context (i.e. whichever VCPU context
has its \texttt{\footnotesize upcall\_pending} bit set).

Some Xen hypercalls must only be executed on the VCPU to which the
hypercall applies. In cases where PTLsim cannot emulate the hypercall
on its own internal state (and defer the actual hypercall until switching
back to native mode), the Xen hypervisor has been modified to support
an explicit \emph{vcpu} parameter, allowing the first VCPU (which
always runs PTLsim itself) to execute the required action on behalf
of other VCPUs.

For simultaneous multithreading support, PTLsim is designed to run
the simulation entirely on the first VCPU, while putting the other
VCPUs in an idle loop. This is required because there's no easy way
to parallelize an SMT core model across multiple simulation threads.
In theory, a multi-core simulator could in fact be parallelized in
this way, but it would be very difficult to reproduce cycle accurate
behavior and debug deadlocks with asynchronous simulations running
in different threads. For these reasons, currently PTLsim itself is
single threaded in simulation mode, even though it simulates multiple
virtual cores or threads.

Cache coherence is the responsibility of each core model. By default,
PTLsim uses the {}``instant visibility'' model, in which all VCPUs
can have read/write copies of cache lines and all stores appear on
all other VCPUs the instant they commit. More complex MOESI-compliant
policies can be implemented on top of this basic framework, by stalling
simulated VCPUs until cache lines travel across an interconnect network.


\part{\label{part:OutOfOrderModel}Out of Order Processor Model}


\chapter{\label{sec:OutOfOrderFeatures}Introduction}


\section{Out Of Order Core Features}

PTLsim completely models a modern out of order x86-64 compatible processor,
cache hierarchy and key devices with true cycle accurate simulation.
The basic microarchitecture of this model is a combination of design
features from the Intel Pentium 4, AMD K8 and Intel Core 2, but incorporates
some ideas from IBM Power4/Power5 and Alpha EV8. The following is
a summary of the characteristics of this processor model:

\begin{itemize}
\item The simulator directly fetches pre-decoded micro-operations (Section
\ref{sec:FetchStage}) but can simulate cache accesses as if x86 instructions
were being decoded on fetch
\item Branch prediction is configurable; PTLsim currently includes various
models including a hybrid g-share based predictor, bimodal predictors,
saturating counters, etc.
\item Register renaming takes into account x86 quirks such as flags renaming
(Section \ref{sub:FlagsManagement})
\item Front end pipeline has configurable number of cycles to simulate x86
decoding or other tasks; this is used for adjusting the branch mispredict
penalty
\item Unified physical and architectural register file maps both in-flight
uops as well as committed architectural register values. Two rename
tables (speculative and committed register rename tables) are used
to track which physical registers are currently mapped to architectural
registers.
\item Unified physical register file for both integer and floating point
values.
\item Operands are read from the physical register file immediately before
issue. Unlike in some microprocessors, PTLsim does not do speculative
scheduling: the schedule and register read loop is assumed to take
one cycle.
\item Issue queues based on a collapsing design use broadcast based matching
to wake up instructions.
\item Clustered microarchitecture is highly configurable, allowing multi-cycle
latencies between clusters and multiple issue queues within the same
logical cluster.
\item Functional units, mapping of functional units to clusters, issue ports
and issue queues and uop latencies are all configurable.
\item Speculation recovery from branch mispredictions and load/store aliasing
uses the forward walk method to recover the rename tables, then annuls
all uops after and optionally including the mis-speculated uop.
\item Replay of loads and stores after store to load forwarding and store
to store merging dependencies are discovered.
\item Stores may issue even before data to store is known; the store uop
is replayed when all operands arrive.
\item Load and store queues use partial chunk address matching and store
merging for high performance and easy circuit implementation.
\item Prediction of load/store aliasing to avoid mis-speculation recovery
overhead.
\item Prediction and splitting of unaligned loads and stores to avoid mis-speculation
overhead
\item Commit unit supports stalling until all uops in an x86 instruction
are complete, to make x86 instruction commitment atomic
\end{itemize}
The PTLsim model is fully configurable in terms of the sizes of key
structures, pipeline widths, latency and bandwidth and numerous other
features.


\section{Processor Contexts}

PTLsim uses the concept of a \emph{VCPU} (virtual CPU) to represent
one user-visible microprocessor core (or a hardware thread if a SMT
machine is being modeled). The \texttt{\footnotesize Context} structure
(defined in \texttt{\footnotesize ptlhwdef.h}) maintains all per-VCPU
state in PTLsim: this includes both user-visible architectural registers
(in the \texttt{\footnotesize Context.commitarf{[}]} array) as well
as all per-core control registers and internal state information.
\texttt{\footnotesize Context} only contains general x86-visible context
information; specific machine models must maintain microarchitectural
state (like physical registers and so forth) in their own internal
structures.

The \texttt{\footnotesize contextof(}\texttt{\emph{\footnotesize N}}\texttt{\footnotesize )}
macro is used to return the \texttt{\footnotesize Context} object
for a specific VCPU, numbered 0 to \texttt{\footnotesize contextcount}-1.
In userspace-only PTLsim, there is only one context, \texttt{\footnotesize contextof(0)}.
In full system PTLsim/X, there may be up to 32 (i.e. \texttt{\footnotesize MAX\_CONTEXTS})
separate contexts (VCPUs).


\section{\label{sec:MachineCoreClassHierarchy}PTLsim Machine/Core/Thread
Class Hierarchy}

PTLsim easily supports user defined plug-in machine models. Two of
these models, the out of order core ({}``\texttt{\footnotesize ooo}'')
and the sequential in-order core ({}``\texttt{\footnotesize seq}'')
ship with PTLsim; others can be easily added by users. PTLsim implements
several C++ classes used to build simulation models by dividing a
virtual machine into CPU sockets, cores and threads.

The \texttt{\footnotesize PTLsimMachine} class is at the root of the
hierarchy. Every simulation model must subclass \texttt{\footnotesize PTLsimMachine}
and define its virtual methods. Adding a machine model to PTLsim is
very simple: simply define one instance of your machine class in a
source file included in the Makefile. For instance, assuming \texttt{\footnotesize XYZMachine}
subclasses \texttt{\footnotesize PTLsimMachine} and will be called
{}``xyz'':

\begin{lyxcode}
{\footnotesize XyzMachine~xyzmodel({}``xyz'');}{\footnotesize \par}
\end{lyxcode}
The constructor for \texttt{\footnotesize XyzMachine} will be called
by PTLsim after all other subsystems are brought up. It should use
the \texttt{\footnotesize addmachine({}``}\texttt{\emph{\footnotesize name}}\texttt{\footnotesize '')}
static method to register the core model's name with PTLsim, so it
can be specified using the {}``\texttt{\footnotesize -core}{\footnotesize{}
}\texttt{\emph{\footnotesize xyz}}'' option.

The machine models included with PTLsim (namely, \texttt{\footnotesize OutOfOrderMachine}
and \texttt{\footnotesize SequentialMachine}) have been placed in
their own C++ namespace. When adding your own core, copy the example
source file(s) to new names and adjust the namespace specifiers to
a new name to avoid clashes. You should be able to link any number
of machine models defined in this manner into PTLsim all at once.

The \texttt{\footnotesize PTLsimMachine::init()} method is called
to initialize each machine model the first time it is used. This function
is responsible for dividing the \texttt{\emph{\footnotesize contextcount}}
contexts up into sockets, cores and threads, depending entirely on
the machine model's design and any configuration options specified
by the \texttt{\footnotesize config} parameter.

\texttt{\footnotesize PTLsimMachine::run()} is called to actually
run the simulation; more details will be given on this later.

\texttt{\footnotesize PTLsimMachine::update\_stats()} is described
in Section \ref{sec:StatisticsInfrastructure}.

\texttt{\footnotesize PTLsimMachine::dump\_state()} is called to aid
debugging whenever an assertion fails, the simulator accesses a null
pointer or invalid address, or from anywhere else it may be useful.


\chapter{Out Of Order Core Overview}

The out of order core is spread across several source files:

\begin{itemize}
\item \texttt{\footnotesize ooocore.cpp} contains control logic, the definition
of the \texttt{\footnotesize OutOfOrderMachine} class and its functions
(see Section \ref{sec:MachineCoreClassHierarchy}), the top-level
pipeline control functions, all event printing logic (Section \ref{sec:OutOfOrderCoreEventLogRingBuffer})
and miscellaneous code.
\item \texttt{\footnotesize ooopipe.cpp} contains all pipeline stages, except
for execution stages and functional units.
\item \texttt{\footnotesize oooexec.cpp} contains the functional units,
load/store unit, issue queues, replay control and exception handling.
\item \texttt{\footnotesize ooocore.h} defines all structures and lists
easy to configure parameters.
\end{itemize}
The \texttt{\footnotesize OutOfOrderMachine} structure is divided
into an array of one or more \texttt{\footnotesize OutOfOrderCore}
structures (by default, one per VCPU). The \texttt{\footnotesize OutOfOrderMachine::init()}
function creates \texttt{\emph{\footnotesize contextcount}} cores
and binds one per-VCPU \texttt{\footnotesize Context} structure to
each core. The \texttt{\footnotesize init()} function is declared
in \texttt{\footnotesize ooocore.h}, since some user configurable
state is set up at this point.

The \texttt{\footnotesize OutOfOrderMachine::run()} function first
flushes the pipeline in each core, using \texttt{\footnotesize core.flush\_pipeline()}
to copy state from the corresponding \texttt{\footnotesize Context}
structure into the physical register file and other per-core structures
(see Section \ref{sec:PipelineFlushesAndBarriers} for details).

The \texttt{\footnotesize run()} function then enters a loop with
one iteration per simulated cycle:

\begin{itemize}
\item \texttt{\footnotesize update\_progress()} prints the current performance
information (cycles, committed instructions and simulated cycles/second)
to the console and/or log file.
\item \texttt{\footnotesize inject\_events()} injects any pending interrupts
and outside events into the processor; these will be processed at
the next x86 instruction boundary. This function only applies to full
system PTLsim/X.
\item The \texttt{\footnotesize OutOfOrderCore::runcycle()} function is
called for each core in sequence, to step its entire state machine
forward by one cycle (see below for details). If a given core is blocked
(i.e. paused while waiting for some outside event), its Context.running
field is zero; in this case, the core's \texttt{\footnotesize handle\_interrupt()}
method may be called to wake it up (see below).
\item Any global structures (like memory controllers or interconnect networks)
are clocked by one cycle using their respective \texttt{\footnotesize clock()}
methods.
\item \texttt{\footnotesize check\_for\_async\_sim\_break()} checks if the
user has requested the simulation stop or switch back to native mode.
This function only applies to full system PTLsim/X.
\item The global cycle counter and other counters are incremented.
\end{itemize}
The \texttt{\footnotesize OutOfOrderCore::runcycle()} function is
where the majority of the work in PTLsim's out of order model occurs.
This function, in ooocore.cpp, runs one cycle in the core by calling
functions to implement each pipeline stage, the per-core data caches
and other clockable structure. If the core's commit stage just encountered
a special event (e.g. barrier, microcode assist request, exception,
interrupt, etc.), the appropriate action is taken at the cycle boundary.

In the following chapters, we describe every pipeline stage and structure
in detail.

Every structure in the out of order model can obtain a reference to
its parent \texttt{\footnotesize OutOfOrderCore} structure by calling
its own \texttt{\footnotesize getcore()}{\footnotesize{} }method. Similarly,
\texttt{\footnotesize getcore().ctx} returns a reference to the \texttt{\footnotesize Context}{\footnotesize{}
}structure for that core.


\section{\label{sec:OutOfOrderCoreEventLogRingBuffer}Event Log Ring Buffer}

Section \ref{sec:EventLogRingBuffer} describes PTLsim's event log
ring buffer system, in which the simulator can log all per-cycle events
to a circular ring buffer when the \texttt{\footnotesize -ringbuf}
option is given. The ring buffer can help developers look backwards
in time from when an undesirable event occurs (for instance, as specified
by \texttt{\footnotesize -ringbuf-trigger-rip}), allowing much easier
debugging and experimentation.

In the out of order core, the \texttt{\footnotesize EventLog} structure
provides this ring buffer. The buffer consists of an array of \texttt{\footnotesize OutOfOrderCoreEvent}
structures (in \texttt{\footnotesize ooocore.h}); each structure contains
a fixed header with subject information common to all events (e.g.
the cycle, uuid, RIP, uop, ROB slot, and so forth), plus a union with
sub-structures for each possible event type. The actual events are
listed in an enum above this structure.

The \texttt{\footnotesize EventLog} class has various functions for
quickly adding certain types of events and filling in their special
fields. Specifically, calling one of the \texttt{\footnotesize EventLog::add()}
functions allocates a new record in the ring buffer and returns a
pointer to it, allowing additional event-specific fields to be filled
in if needed. The usage of these functions is very straightforward
and documented by example in the various out of order core source
files.

In \texttt{\footnotesize ooocore.cpp}, the \texttt{\footnotesize OutOfOrderCoreEvent::print()}
method lists all event types and gives code to nicely format the recorded
event data. The \texttt{\footnotesize eventlog.print()} function prints
every event in the ring buffer; this function can be called from anywhere
an event backtrace is needed.


\chapter{Fetch Stage}


\section{\label{sec:FetchStage}Instruction Fetching and the Basic Block Cache}

As described in Section \ref{sec:UopIntro}, x86 instructions are
decoded into transops prior to actual execution by the out of order
core. Some processors do this translation as x86 instructions are
fetched from an L1 instruction cache, while others use a trace cache
to store pre-decoded uops. PTLsim takes a middle ground to allow maximum
simulation flexibility. Specifically, the Fetch stage accesses the
L1 instruction cache and stalls on cache misses as if it were fetching
several variable length x86 instructions per cycle. However, actually
decoding x86 instructions into uops over and over again during simulation
would be extraordinarily slow.

Therefore, for \emph{simulation purposes only}, the out of order model
uses the PTLsim \emph{basic block cache}. The basic block cache, described
in Chapter \ref{sec:BasicBlockCache}, stores pre-decoded uops for
each basic block, and is indexed using the \texttt{\footnotesize RIPVirtPhys}
structure, consisting of the RIP virtual address, several context-dependent
flags and the physical page(s) spanned by the basic block (in PTLsim/X
only).

During the fetch process (implemented in the \texttt{\footnotesize OutOfOrderCore::fetch()}
function in \texttt{\footnotesize ooopipe.cpp}), PTLsim looks up the
current RIP to fetch from (\texttt{\footnotesize fetchrip}), uses
the current context to construct a full \texttt{\footnotesize RIPVirtPhys}
key, then uses this key to query the basic block cache. If the basic
block has never been decoded before, \texttt{\footnotesize bbcache.translate()}
is used to do this now. This is all done by the \texttt{\footnotesize fetch\_or\_translate\_basic\_block()}
function.

Technically speaking, the cached basic blocks contain \emph{transops},
rather than uops: as explained in Section \ref{sec:UopIntro}, each
transop gets transformed into a true uop after it is renamed in the
rename stage. In the following discussion, the term uop is used interchangeably
with transop.


\section{Fetch Queue}

Each transop fetched into the pipeline is immediately assigned a monotonically
increasing \emph{uuid} (universally unique identifier) to uniquely
track it for debugging and statistical purposes. The fetch unit attaches
additional information to each transop (such as the uop's uuid and
the \texttt{\footnotesize RIPVirtPhys} of the corresponding x86 instruction)
to form a \texttt{\footnotesize FetchBufferEntry} structure. This
fetch buffer is then placed into the fetch queue (\texttt{\footnotesize fetchq})
assuming it isn't full (if it is, the fetch stage stalls). As the
fetch unit encounters transops with their EOM (end of macro-op) bit
set, the fetch RIP is advanced to the next x86 instruction according
to the instruction length stored in the SOM transop.

Branch uops trigger the branch prediction mechanism (Section \ref{sec:BranchPrediction})
used to select the next fetch RIP. Based on various information encoded
in the branch transop and the next RIP \emph{after} the x86 instruction
containing the branch, the \texttt{\footnotesize branchpred.predict()}
function is used to redirect fetching. If the branch is predicted
not taken, the sense of the branch's condition code is inverted and
the transop's \texttt{\footnotesize riptaken} and \texttt{\footnotesize ripseq}
fields are swapped; this ensures all branches are considered correct
only if taken. Indirect branches (jumps) have their \texttt{\footnotesize riptaken}
field overwritten by the predicted target address.

PTLsim models the instruction cache by using the \texttt{\footnotesize caches.probe\_icache()}
function to probe the cache with the physical address of the current
\emph{fetch window}. Most modern x86 processors fetch aligned 16-byte
or 32-byte blocks of bytes into the decoder and try to pick out 3
or 4 x86 instructions per cycle. Since PTLsim uses the basic block
cache, it does not actually decode anything at this point, but it
still attempts to pick out up to 4 uops (or whatever limit is specified
in \texttt{\footnotesize ooocore.h}) within the current 16-byte window
around the fetch RIP; switching to a new window must occur in the
next cycle. The instruction cache is only probed when switching fetch
windows.

If the instruction cache indicates a miss, or the ITLB misses, the
\texttt{\footnotesize waiting\_for\_icache\_fill} variable is set,
and the fetch unit remains stalled in subsequent cycles until the
cache subsystem calls the \texttt{\footnotesize OutOfOrderCoreCacheCallbacks::icache\_wakeup()}
callback registered by the core. The core's interactions with the
cache subsystem will be described in great detail later on.


\chapter{Frontend and Key Structures}


\section{Resource Allocation}

During the Allocate stage, PTLsim dequeues uops from the fetch queue,
ensures all resources needed by those uops are free, and assigns resources
to each uop as needed. These resources include Reorder Buffer (ROB)
slots, physical registers and load store queue (LSQ) entries. In the
event that the fetch queue is empty or any of the ROB, physical register
file, load queue or store queue is full, the allocation stage stalls
until some resources become available.


\section{Reorder Buffer Entries}

The Reorder Buffer (ROB) in the PTLsim out of order model works exactly
like a traditional ROB: as a queue, entries are allocated from the
tail and committed from the head. Each \texttt{\small ReorderBufferEntry}
structure is the central tracking structure for uops in the pipeline.
This structure contains a variety of fields including:

\begin{itemize}
\item The decoded uop (\texttt{\small uop} field). This is the fully decoded
\texttt{\small TransOp} augmented with fetch-related information like
the uop's UUID, RIP and branch predictor information as described
in the Fetch stage (Section \ref{sec:FetchStage}).
\item Current state of the ROB entry and uop (\texttt{\small current\_state\_list};
see below)
\item Pointers to the physical register (\texttt{\small physreg}), LSQ entry
(\texttt{\small lsq}) and other resources allocated to the uop
\item Pointers to the three physical register operands to the uop, as well
as a possible store dependency used in replay scheduling (described
later)
\item Various cycle counters and related fields for simulating progress
through the pipeline
\end{itemize}

\subsection{ROB States}

Each ROB entry and corresponding uop can be in one of a number of
states describing its progress through the simulator state machine.
ROBs are linked into linked lists according to their current state;
these lists are named \texttt{\footnotesize rob\_}\emph{statename}\texttt{\footnotesize \_list}.
The \texttt{\footnotesize current\_state\_list} field specifies the
list the ROB is currently on. ROBs can be moved between states using
the \texttt{\footnotesize ROB::changestate(}\texttt{\emph{\footnotesize statelist}}\texttt{\footnotesize )}
method. The specific states will be described below as they are encountered.

\textbf{\emph{NOTE:}} the terms {}``ROB entry'' (singular) and {}``uop''
are used interchangeably from now on unless otherwise stated, since
there is a 1:1 mapping between the two.


\section{\label{sec:PhysicalRegisters}Physical Registers}


\subsection{Physical Registers}

Physical registers are represented in PTLsim by the \texttt{\small PhysicalRegister}
structure. Physical registers store several components:

\begin{itemize}
\item Index of the physical register (\texttt{\small idx}) and the physical
register file id (\texttt{\small rfid}) to which it belongs
\item The actual 64-bit register data
\item x86 flags: Z, P, S, O, C. These are discussed below in Section \ref{sub:FlagsManagement}.
\item Waiting flag (\texttt{\small FLAG\_WAIT}) for results not yet ready
\item Invalid flag (\texttt{\small FLAG\_INVAL}) for ready results which
encountered an exception. The exception code is written to the data
field in lieu of the real result
\item Current state of the physical register (\texttt{\small state})
\item ROB currently owning this physical register, or architectural register
mapping this physical register
\item Reference counter for the physical register. This is required for
reasons described in Section \ref{sub:PhysicalRegisterRecyclingComplications}.
\end{itemize}

\subsection{Physical Register File}

PTLsim uses a flexible physical register file model in which multiple
physical register files with different sizes and properties can optionally
be defined. Each physical register file in the \texttt{\small OutOfOrderCore::physregfiles{[}]}
array can be made accessible from one or more clusters. For instance,
uops which execute on floating point clusters can be forced to always
allocate a register in the floating point register file, or each cluster
can have a dedicated register file.

Various heuristics can also be used for selecting the register file
into which a result is placed. The default heuristic simply finds
the first acceptable physical register file with a free register.
Acceptable physical register files are those register files in which
the uop being allocated is allowed to write its result; this is configurable
based on clustering as described below. Other allocation policies,
such as alternation between available register files and dependency
based register allocation, are all possible by modifying the \texttt{\small rename()}
function where physical registers are allocated..

In each physical register file, physical register number 0 is defined
as the \emph{null register:} it always contains the value zero and
is used as an operand anywhere the zero value (or no value at all)
is required.

Physical register files are configured in \texttt{\small ooocore.h}.
The \texttt{\small PhysicalRegisterFile{[}]} array is defined to declare
each register file by name, register file ID (RFID, from 0 to the
number of register files) and size. The \texttt{\small MAX\_PHYS\_REG\_FILE\_SIZE}
parameter must be greater than the largest physical register in the
processor.


\subsection{Physical Register States}

Each physical register can be in one of several states at any given
time. For each physical register file, PTLsim maintains linked lists
(the \texttt{\small PhysicalRegisterFile.states{[}}\emph{statename}\texttt{\small ]}
lists) to track which registers are in each state. The \texttt{\small state}
field in each physical register specifies its state, and implies that
the physical register is on the list \texttt{\small physregfiles{[}physreg.}\texttt{\textbf{\small rfid}}\texttt{\small ].states{[}physreg.}\texttt{\textbf{\small state}}\texttt{\small ]}.
The valid states are:

\begin{itemize}
\item \textbf{\emph{free:}} the register is not allocated to any uop.
\item \textbf{\emph{waiting:}} the register has been allocated to a uop
but that uop is waiting to issue.
\item \textbf{\emph{bypass:}} the uop associated with the register has issued
and produced a value (or encountered an exception), but that value
is only on the bypass network - it has not actually been written back
yet. For simulation purposes only, uops immediately write their results
into the physical register as soon as they issue, even though technically
the result is still only on the bypass network. This helps simplify
the simulator considerably without compromising accuracy.
\item \textbf{\emph{written:}} the uop associated with the register has
passed through the writeback stage and the value of the physical register
is now up to date; all future consumers will read the uop's result
from this physical register.
\item \textbf{\emph{arch:}} the physical register is currently mapped to
one of the architectural registers; it has no associated uop currently
in the pipeline
\item \textbf{\emph{pendingfree:}} this is a special state described in
Section \ref{sub:PhysicalRegisterRecyclingComplications}.
\end{itemize}
One physical register is allocated to each uop and moved into the
\emph{waiting} state, regardless of which type of uop it is. For integer,
floating point and load uops, the physical register holds the actual
numerical value generated by the corresponding uop. Branch uops place
the target RIP of the branch in a physical register. Store uops place
the merged data to store in the register. Technically branches and
stores do not need physical registers, but to keep the processor design
simple, they are allocated registers anyway.


\section{\label{sec:LoadStoreQueueEntry}Load Store Queue Entries}

Load Store Queue (LSQ) Entries (the \texttt{\small LoadStoreQueueEntry}
structure in PTLsim) are used to track additional information about
loads and stores in the pipeline that cannot be represented by a physical
register. Specifically, LSQ entries track:

\begin{itemize}
\item \textbf{Physical address} of the corresponding load or store
\item \textbf{Data} field (64 bits) stores the loaded value (for loads)
or the value to store (for stores)
\item \textbf{Address valid} bit flag indicates if the load or store knows
its effective physical address yet. If set, the physical address field
is valid.
\item \textbf{Data valid} bit flag indicates if the data field is valid.
For loads, this is set when the data has arrived from the cache. For
stores, this is set when the data to store becomes ready and is merged.
\item \textbf{Invalid} bit flag is set if an exception occurs in the corresponding
load or store.
\end{itemize}
The \texttt{\small LoadStoreQueueEntry} structure is technically a
superset of a structure known as an \emph{SFR} (Store Forwarding Register),
which completely represents any load or store and can be passed between
PTLsim subsystems easily. One LSQ entry is allocated to each load
or store during the Allocate stage.

In real processors, the load queue (LDQ) and store queue (STQ) are
physically separate for circuit complexity reasons. However, in PTLsim
a unified LSQ is used to make searching operations easier. One additional
bit flag (\texttt{\small store} bit) specifies whether an LSQ entry
is a load or store.


\subsection{\label{sub:RegisterRenaming}Register Renaming}

The basic register renaming process in the PTLsim x86 model is very
similar to classical register renaming, with the exception of the
flags complications described in Section \ref{sub:FlagsManagement}.
Two versions of the register rename table (RRT) are maintained: a
\emph{speculative RRT} which is updated as uops are renamed, and a
\emph{commit RRT}, which is only updated when uops successfully commit.
Since the simulator implements a unified physical and architectural
register file, the commit process does not actually involve any data
movement between physical and architectural registers: only the commit
RRT needs to be updated. The commit RRT is used only for exception
and branch mispredict recovery, since it holds the last known good
mapping of architectural to physical registers.

Each rename table contains 80 entries as shown in Table \ref{table:ArchitecturalRegisters}.
This table maps architectural registers and pseudo-registers to the
most up to date physical registers for the following:

\begin{itemize}
\item 16 x86-64 integer registers
\item 16 128-bit SSE registers (represented as separate 64-bit high and
low halves)
\item ZAPS, CF, OF flag sets described in Section \ref{sub:FlagsManagement}.
These rename table entries point to the physical register (with attached
flags) of the most recent uop in program order to update any or all
of the ZAPS, CF, OF flag sets, respectively.
\item Various integer and x87 status registers
\item Temporary pseudo-registers \texttt{\small temp0}-\texttt{\small temp7}
not visible to x86 code but required to hold temporaries (e.g. generated
addresses or value to swap in \texttt{\small xchg} instructions).
\item Special fixed values, e.g. \texttt{\small zero}, \texttt{\small imm}
(value is in immediate field), \texttt{\small mem} (destination of
stores)
\end{itemize}
%
\begin{table}
\caption{\label{table:ArchitecturalRegisters}Architectural registers and pseudo-registers
used for renaming.}


\noindent \begin{centering}
\begin{tabular}{|l|l|l|l|l|l|l|l|l|}
\hline 
\multicolumn{9}{|c|}{\textsf{\small Architectural Registers and Pseudo-Registers}}\tabularnewline
\hline
\hline 
\textsf{\small 0} & \texttt{\footnotesize rax} & \texttt{\footnotesize rcx} & \texttt{\footnotesize rdx} & \texttt{\footnotesize rbx} & \texttt{\footnotesize rsp} & \texttt{\footnotesize rbp} & \texttt{\footnotesize rsi} & \texttt{\footnotesize rdi}\tabularnewline
\hline 
\textsf{\small 8} & \texttt{\footnotesize r8} & \texttt{\footnotesize r9} & \texttt{\footnotesize r10} & \texttt{\footnotesize r11} & \texttt{\footnotesize r12} & \texttt{\footnotesize r13} & \texttt{\footnotesize r14} & \texttt{\footnotesize r15}\tabularnewline
\hline 
\textsf{\small 16} & \texttt{\footnotesize xmml0} & \texttt{\footnotesize xmmh0} & \texttt{\footnotesize xmml1} & \texttt{\footnotesize xmmh1} & \texttt{\footnotesize xmml2} & \texttt{\footnotesize xmmh2} & \texttt{\footnotesize xmml3} & \texttt{\footnotesize xmmh3}\tabularnewline
\hline 
\textsf{\small 24} & \texttt{\footnotesize xmml4} & \texttt{\footnotesize xmmh4} & \texttt{\footnotesize xmml5} & \texttt{\footnotesize xmmh5} & \texttt{\footnotesize xmml6} & \texttt{\footnotesize xmmh6} & \texttt{\footnotesize xmml7} & \texttt{\footnotesize xmmh7}\tabularnewline
\hline 
\textsf{\small 32} & \texttt{\footnotesize xmml8} & \texttt{\footnotesize xmmh8} & \texttt{\footnotesize xmml9} & \texttt{\footnotesize xmmh9} & \texttt{\footnotesize xmml10} & \texttt{\footnotesize xmmh10} & \texttt{\footnotesize xmml11} & \texttt{\footnotesize xmmh11}\tabularnewline
\hline 
\textsf{\small 40} & \texttt{\footnotesize xmml12} & \texttt{\footnotesize xmmh12} & \texttt{\footnotesize xmml13} & \texttt{\footnotesize xmmh13} & \texttt{\footnotesize xmml14} & \texttt{\footnotesize xmmh14} & \texttt{\footnotesize xmml15} & \texttt{\footnotesize xmmh15}\tabularnewline
\hline 
\textsf{\small 48} & \texttt{\footnotesize fptos} & \texttt{\footnotesize fpsw} & \texttt{\footnotesize fptags} & \texttt{\footnotesize fpstack} & \texttt{\footnotesize tr4} & \texttt{\footnotesize tr5} & \texttt{\footnotesize tr6} & \texttt{\footnotesize ctx}\tabularnewline
\hline 
\textsf{\small 56} & \texttt{\footnotesize rip} & \texttt{\footnotesize flags} & \texttt{\footnotesize iflags} & \texttt{\footnotesize selfrip} & \texttt{\footnotesize nextrip} & \texttt{\footnotesize ar1} & \texttt{\footnotesize ar2} & \texttt{\footnotesize zero}\tabularnewline
\hline
\hline 
\textsf{\small 64} & \texttt{\footnotesize temp0} & \texttt{\footnotesize temp1} & \texttt{\footnotesize temp2} & \texttt{\footnotesize temp3} & \texttt{\footnotesize temp4} & \texttt{\footnotesize temp5} & \texttt{\footnotesize temp6} & \texttt{\footnotesize temp7}\tabularnewline
\hline 
\textsf{\small 72} & \texttt{\footnotesize zf} & \texttt{\footnotesize cf} & \texttt{\footnotesize of} & \texttt{\footnotesize imm} & \texttt{\footnotesize mem} & \texttt{\footnotesize temp8} & \texttt{\footnotesize temp9} & \texttt{\footnotesize temp10}\tabularnewline
\hline
\end{tabular}
\par\end{centering}
\end{table}


Once the uop's three architectural register sources are mapped to
physical registers, these physical registers are placed in the \texttt{\small operands{[}}0,1,2\texttt{\small ]}
fields. The fourth operand field, \texttt{\small operands{[}}3\texttt{\small ]},
is used to hold a store buffer dependency for loads and stores; this
will be discussed later. The speculative RRT entries for both the
destination physical register and any modified flags are then overwritten.
Finally, the ROB is moved into the \textbf{\emph{frontend}} state.


\subsection{External State}

Since the rest of the simulator outside of the out of order core does
not know about the RRTs and expects architectural registers to be
in a standardized format, the per-core \texttt{\footnotesize Context}
structure is used to house the architectural register file. These
architectural registers, including \texttt{\footnotesize REG\_flags}
and \texttt{\footnotesize REG\_rip}, are directly updated in program
order by the out of order core as instructions commit.


\section{Frontend Stages}

To simulate various processor frontend pipeline depths, ROBs are placed
in the \emph{frontend} state for a user-selectable number of cycles.
In the \texttt{\footnotesize frontend()} function, the \texttt{\footnotesize cycles\_left}
field in each ROB is decremented until it becomes zero. At this point,
the uop is moved to the \textbf{\emph{ready\_to\_dispatch}} state.
This feature can be used to simulate various branch mispredict penalties
by setting the \texttt{\footnotesize FRONTEND\_STAGES} constant.


\chapter{\label{sec:ClusterDispatchScheduleIssue}Scheduling, Dispatch and
Issue}


\section{\label{sec:Clustering}Clustering and Issue Queue Configuration}

The PTLsim out of order model can simulate an arbitrarily complex
set of functional units grouped into \emph{clusters}. Clusters are
specified by the \texttt{\footnotesize Cluster} structure and are
defined by the \texttt{\footnotesize clusters{[}]} array in \texttt{\footnotesize ooocore.h}.
Each \texttt{\footnotesize Cluster} element defines the name of the
cluster, which functional units belong to the cluster (\texttt{\footnotesize fu\_mask}
field) and the maximum number of uops that can be issued in that cluster
each cycle (\texttt{\footnotesize issue\_width} field)

The \texttt{\footnotesize intercluster\_latency\_map} matrix defines
the forwarding latency, in cycles, between a given cluster and every
other cluster. If \texttt{\footnotesize intercluster\_latency\_map{[}}\emph{A}\texttt{\footnotesize ]{[}}\emph{B}\texttt{\footnotesize ]}
is \emph{L} cycles, this means that functional units in cluster \emph{B}
must wait \emph{L} cycles after a uop \emph{U} in cluster A completes
before cluster B's functional units can issue a uop dependent on \emph{U}'s
result. If the latency is zero between clusters \emph{A} and \emph{B},
producer and consumer uops in \emph{A} and \emph{B} can always be
issued back to back in subsequent cycles. Hence, the diagonal of the
forwarding latency matrix is always all zeros.

This clustering mechanism can be used to implement several features
of modern microprocessors. First, traditional clustering is possible,
in which it takes multiple additional cycles to forward results between
different clusters (for instance, one or more integer clusters and
a floating point unit). Second, several issue queues and corresponding
issue width limits can be defined within a given virtual cluster,
for instance to sort loads, stores and ALU operations into separate
issue queues with different policies. This is done by specifying an
inter-cluster latency of zero cycles between the relevant pseudo-clusters
with separate issue queues. Both of these uses are required to accurately
model most modern processors.

There is also an equivalent \texttt{\footnotesize intercluster\_bandwidth\_map}
matrix to specify the maximum number of values that can be routed
between any two clusters each cycle.

The \texttt{\footnotesize IssueQueue} template class is used to declare
issue queues; each cluster has its own issue queue. The syntax \texttt{\footnotesize IssueQueue<}\emph{size}\texttt{\footnotesize >}
\texttt{\footnotesize issueq\_}\emph{name}\texttt{\small ;} is used
to declare an issue queue with a specific size. In the current implementation,
the size can be from 1 to 64 slots. The macros \texttt{\footnotesize foreach\_issueq()},
\texttt{\footnotesize sched\_get\_all\_issueq\_free\_slots()} and
\texttt{\footnotesize issueq\_operation\_on\_cluster\_with\_result()}
macros must be modified if the cluster and issue queue configuration
is changed to reflect all available clusters; the modifications required
should be obvious from the example code. These macros with switch
statements are required instead of a simple array since the issue
queues can be of different template types and sizes.


\section{Cluster Selection}

The \texttt{\footnotesize ReorderBufferEntry::select\_cluster()} function
is responsible for routing a given uop into a specific cluster at
the time it is dispatched; uops do not switch between clusters after
this.

Various heuristics are employed to select which cluster a given uop
should be routed to. In the reference implementation provided in \texttt{\footnotesize ooopipe.cpp},
a weighted score is generated for each possible cluster by scanning
through the uop's operands to determine which cluster they will be
forwarded from. If a given operand's corresponding producer uop \emph{S}
is currently either dispatched to cluster \emph{C} but waiting to
execute or is still on the bypass network of cluster \emph{C}, then
cluster \emph{C}'s score is incremented. 

The final cluster is selected as the cluster with the highest score
out of the set of clusters which the uop can actually issue on (e.g.
a floating point uop cannot issue on a cluster with only integer units).
The \texttt{\footnotesize ReorderBufferEntry::executable\_on\_cluster\_mask}
bitmap can be used to further restrict which clusters a uop can be
dispatched to, for instance because certain clusters can only write
to certain physical register files. This mechanism is designed to
route each uop to the cluster in which the majority of its operands
will become available at the earliest time; in practice it works quite
well and variants of this technique are often used in real processors.


\section{\label{sec:Scheduling}Issue Queue Structure and Operation}

PTLsim implements issue queues in the \texttt{\footnotesize IssueQueue}
template class using the collapsing priority queue design used in
most modern processors. 

As each uop is dispatched, it is placed at the end of the issue queue
for its cluster and several associative arrays are updated to reflect
which operands the uop is still waiting for. In the IssueQueue class,
the \texttt{\small insert()} method takes the ROB index of the uop
(its \emph{tag} in issue queue terminology), the tags (ROB indices)
of its operands, and a map of which of the operands are ready versus
waiting. The ROB index is inserted into an associative array, and
the ROB index tags of any waiting operands are inserted into corresponding
slots in parallel arrays, one array per operand (in the current implementation,
up to 4 operands are tracked). If an operand was ready at dispatch
time, the slot for that operand in the corresponding array is marked
as invalid since there is no need to wake it up later. Notice that
the new slot is always at the end of the issue queue array; this is
made possible by the collapsing mechanism described below.

The issue queue maintains two bitmaps to track the state of each slot
in the queue. The \texttt{\small valid} bitmap indicates which slots
are occupied by uops, while the \texttt{\small issued} bitmap indicates
which of those uops have been issued. Together, these two bitmaps
form the state machine described in Table \ref{table:IssueQueueStateMachine}.

%
\begin{table}
\caption{\label{table:IssueQueueStateMachine}Issue Queue State Machine}


\noindent \begin{centering}
\begin{tabular}{|c|c|l|}
\hline 
Valid & Issued & Meaning\tabularnewline
\hline
\hline 
\texttt{\small 0} & \texttt{\small 0} & Unused slot\tabularnewline
\hline 
\texttt{\small 0} & \texttt{\small 1} & (invalid)\tabularnewline
\hline 
\texttt{\small 1} & \texttt{\small 0} & Dispatched but waiting for operands\tabularnewline
\hline 
\texttt{\small 1} & \texttt{\small 1} & Issued to a functional unit but not yet completed\tabularnewline
\hline
\end{tabular}
\par\end{centering}
\end{table}


After \texttt{\footnotesize insert()} is called, the slot is placed
in the dispatched state. As each uop completes, its tag (ROB index)
is broadcast using the \texttt{\footnotesize broadcast()} method to
one or more issue queues accessible in that cycle. Because of clustering,
some issue queues will receive the broadcast later than others; this
is discussed below. Each slot in each of the four operand arrays is
compared against the broadcast value. If the operand tag in that slot
is valid and matches the broadcast tag, the slot (in one of the operand
arrays only, not the entire issue queue) is invalidated to indicate
it is ready and no longer waiting for further broadcasts.

Every cycle, the \texttt{\footnotesize clock()} method uses the \texttt{\small valid}
and \texttt{\small issued} bitmaps together with the valid bitmaps
of each of the operand arrays to compute which issue queue slots in
the dispatched state are no longer waiting on any of their operands.
This bitmap of ready slots is then latched into the \texttt{\small allready}
bitmap.

The \texttt{\footnotesize issue()} method simply finds the index of
the first set bit in the \texttt{\small allready} bitmap (this is
the slot of the oldest ready uop in program order), marks the corresponding
slot as issued, and returns the slot. The processor then selects a
functional unit for the uop in that slot and executes it via the \texttt{\footnotesize ReorderBufferEntry::issue()}
method. After the uop has completed execution (i.e. it cannot possibly
be replayed), the \texttt{\footnotesize release()} method is called
to remove the slot from the issue queue, freeing it up for incoming
uops in the dispatch stage. The collapsing design of the issue queue
means that the slot is not simply marked as invalid - all slots after
it are physically shifted left by one, leaving a free slot at the
end of the array. This design is relatively simple to implement in
hardware and makes determining the oldest ready to issue uop very
trivial.

Because of the collapsing mechanism, it is critical to note that the
slot index returned by \texttt{\footnotesize issue()} will become
invalid after the next call to the \texttt{\footnotesize remove()}
method; hence, it should never be stored anywhere if a slot could
be removed from the issue queue in the meantime.

If a uop issues but determines that it cannot actually complete at
that time, it must be \emph{replayed}. The \texttt{\footnotesize replay()}
method clears the issued bit for the uop's issue queue slot, returning
it to the dispatched state. The replay mechanism can optionally add
additional dependencies such that the uop is only re-issued after
those dependencies are resolved. This is important for loads and stores,
which may need to add a dependency on a prior store queue entry after
finding a matching address in the load or store queues. In rare cases,
a replay may also be required when a uop is issued but no applicable
functional units are left for it to execute on. The \texttt{\footnotesize ReorderBufferEntry::replay()}
method is a wrapper around \texttt{\footnotesize IssueQueue::replay()}
used to collect the operands the uop is still waiting for.


\subsection{Implementation}

PTLsim uses a novel method of modeling the issue queue and other associative
structures with small tags. Specifically, the \texttt{\footnotesize FullyAssociativeArrayTags8bit}
template class declared in \texttt{\footnotesize logic.h} and used
to build the issue queue makes use of the host processor's 128-bit
vector (SSE) instructions to do massively parallel associative matching,
masking and bit scanning on up to 16 tags every clock cycle. This
makes it substantially faster than simulators using the naive approach
of scanning the issue queue entries linearly. Similar classes in \texttt{\small logic.h}
support O(1) associative searches of both 8-bit and 16-bit tags; tags
longer than this are generally more efficient if the generic \texttt{\footnotesize FullyAssociativeArrayTags}
using standard integer comparisons is used instead.

As a result of this high performance design, each issue queue is limited
to 64 entries and the tags to be matched must be between 0 and 255
to fit in 8 bits. The \texttt{\footnotesize FullyAssociativeArrayTags16bit}
class can be used instead if longer tags are required, at the cost
of reduced simulation performance. To enable this, \texttt{\small BIG\_ROB}
must be defined in \texttt{\small ooocore.h}.


\subsection{Other Designs}

It's important to remember that the issue queue design described above
is \emph{one} possible implemention out of the many designs currently
used in industry and research processors. For instance, in lieu of
the collapsing design (used by the Pentium 4 and Power4/5/970), the
AMD K8 uses a sequence number tag of the ROB and comparator logic
to select the earliest ready instruction. Similarly, the Pentium 4
uses a set of bit vectors (a \emph{dependency matrix}) instead of
tag broadcasts to wake up instructions. These other approaches may
be implemented by modifying the \texttt{\footnotesize IssueQueue}
class as appropriate.


\section{\label{sec:Issue}Issue}

The \texttt{\footnotesize issue()} top-level function issues one or
more instructions in each cluster from each issue queue every cycle.
This function consults the \texttt{\footnotesize clusters{[}}\emph{clusterid}\texttt{\footnotesize ].issue\_width}
field defined in \texttt{\footnotesize ooocore.h} to determine the
maximum number of uops to issue from each cluster. The \texttt{\footnotesize issueq\_operation\_on\_cluster\_with\_result(cluster,
iqslot, issue())} macro (Section \ref{sec:Clustering}) is used to
invoke the \texttt{\small issue()} method of the appropriate cluster
to select the earliest ready issue queue slot, as described in Section
\ref{sec:Scheduling}. 

The \texttt{\footnotesize ReorderBufferEntry::issue()} method of the
corresponding ROB entry is then called to actually execute the uop.
This method first makes sure a functional unit is available within
the cluster that's capable of executing the uop; if not, the uop is
replayed and re-issued again on the next cycle. At this point, the
uop's three operands (\texttt{\footnotesize ra}, \texttt{\footnotesize rb},
\texttt{\footnotesize rc}) are read from the physical register file.
If any of the operands are invalid, the entire uop is marked as invalid
with an \texttt{\small EXCEPTION\_Propagate} result and is not further
executed. Otherwise, the uop is executed by calling the synthesized
execute function for the uop (see Section \ref{sec:FetchStage}).

Loads and stores are handled specially by calling the \texttt{\footnotesize issueload()}
or \texttt{\footnotesize issuestore()} method. Since loads and stores
can encounter an mis-speculation (e.g. when a load is erroneously
issued before an earlier store to the same addresses), the \texttt{\footnotesize issueload()}
and \texttt{\footnotesize issuestore()} functions can return \texttt{\footnotesize ISSUE\_MISSPECULATED}{\footnotesize{}
}to force all uops in program order after the mis-speculated uop to
be annulled and sent through the pipeline again. Similarly, if \texttt{\footnotesize issueload()}
or \texttt{\footnotesize issuestore()} return \texttt{\footnotesize ISSUE\_NEEDS\_REPLAY},
issuing from that cluster is aborted since the uop has been replayed
in accordance with Section \ref{sec:Scheduling}. It is important
to note that loads which miss the cache are considered to complete
successfully and do \emph{not} require a replay; their physical register
is simply marked as waiting until the load arrives. In both the mis-speculation
and replay cases, no further uops from the cluster's issue queue are
dispatched until the next cycle.

Branches are handled similar to integer and floating point operations,
except that they may cause a mis-speculation in the event of a branch
misprediction; this is discussed below.

If the uop caused an exception, we force it directly to the commit
stage and not through writeback; this keeps dependencies waiting until
they can be properly annulled by the speculation recovery logic. The
commit stage will detect the exception and take appropriate action.
If the exceptional uop was speculatively executed beyond a branch,
it will never reach commit anyway since the bogus branch would have
to commit before the exception would even become visible.

\textbf{\emph{NOTE:}} In PTLsim, all issued uops put their result
in the uop's assigned physical register at the time of issue, even
though the data technically does not appear there until writeback
(i.e. the physical register enters the \emph{written} state). This
is done to simplify the simulator implementation; it is assumed that
any data {}``read'' from physical registers before writeback is
in fact being read from the bypass network instead.


\chapter{\label{sec:SpeculationAndRecovery}Speculation and Recovery}


\section{Misspeculation Cases}

PTLsim supports three speculative execution recovery mechanisms to
handle various types of speculation failures:

\begin{itemize}
\item \textbf{Replay} is for scheduling and dependency mis-predictions only.
Replayed uops remain in the issue queue so replay is very fast but
limited in scope. Replay is described extensively in Section \ref{sec:ClusterDispatchScheduleIssue}.
\item \textbf{Redispatch} finds the slice of uops in the ROB dependent on
a mis-speculated uop and sends only those dependent uops back to the
\emph{ready-to-dispatch} state. It is used for load-store aliasing
recovery, value mispredictions and other cases where the fetched uops
themselves are still valid, but their outputs are invalid.
\item \textbf{Annulment} removes any uops in program order after (or optionally
including) a given uop. It is used for branch mispredictions and misalignment
recovery.
\end{itemize}

\section{\label{sec:Redispatch}Redispatch }


\subsection{Redispatch Process}

Many types of mis-speculations do not require refetching a different
set of uops; instead, any uops dependent on a mis-speculated uop can
simply be recirculated through the pipeline so they can re-execute
and produce correct values. This process is known as \emph{redispatch};
in the baseline out of order core, it is used to recover from load-store
aliasing (Section \ref{sub:AliasCheck}).

When a mis-speculated ROB is detected, \texttt{\footnotesize ROB.redispatch\_dependents()}
is called. This function identifies the slice of uops that consumed
values (directly or indirectly) from the mis-speculated uop, using
dependency bitmaps similar to those used in real processors. \texttt{\footnotesize ROB.redispatch\_dependents(bool
inclusive)} has an \emph{inclusive} parameter: if false, only the
dependent uops are redispatched, not including the mis-speculated
uop. This is most useful for value prediction, where the correct value
can be directly reinjected into the mis-speculated uop's physical
register without re-executing it.

In \texttt{\footnotesize ROB.redispatch()}, each affected uop is placed
back into the \texttt{\footnotesize rob\_ready\_to\_dispatch}{\small{}
}state, lways in program order. This helps to avoid deadlocks, since
the redispatched slice is given priority for insertion back into the
issue queue. The resources associated with each uop (physical register,
LDQ/STQ slot, IQ slot, etc.) are also restored to the state they were
in immediately after renaming, so they can be properly recirculated
through the pipeline as if the uop never issued. Various other issues
must also be handled, such as making sure known store-to-load aliasing
constraints are preserved across the redispatch so as to avoid infinite
replay loops, and branch directions must be corrected if a mispredict
caused a fetch unit redirection but that mispredict was in fact based
on mis-speculated data.


\subsection{Deadlock Recovery}

Redispatch can create deadlocks in cases where other unrelated uops
occupy all the issue queue slots needed by the redispatched uops to
make forward progress, and there is a circular dependency loop (e.g.
on loads and stores not known at the time of the redispatch) that
creates a chicken-and-egg problem, thus blocking forward progress.

To recover from this situation, we detect the case where no uops have
been dispatched for 64 cycles, yet the \texttt{\footnotesize ready\_to\_dispatch}
queue still has valid uops. This situation very rarely happens in
practice unless there is a true deadlock. To break up the deadlock,
ideally we should only need to redispatch all uops occupying issue
queue slots or those already waiting for dispatch - all others have
produced a result and cannot block the issue queues again. However,
this does not always work in pathological cases, and can sometime
lead to repeated deadlocks. Since deadlocks are very infrequent, they
can be resolved by just flushing the entire pipeline. This has a negligible
impact on performance.


\subsection{Statistical Counters}

Several statistical counters are maintained in the PTLsim statistics
tree to measure redispatch overhead, in the \texttt{\footnotesize ooocore.dispatch.redispatch}
node:

\begin{itemize}
\item \texttt{\footnotesize deadlock-flushes} measures how many times the
pipeline must be flushed to resolve a deadlock.
\item \texttt{\footnotesize trigger-uops} measures how many uops triggered
redispatching because of a misspeculation. This number does not count
towards the statistics below.
\item \texttt{\footnotesize dependent-uops} is a histogram of how many uops
depended on each trigger uop, not including the trigger uop itself.
\end{itemize}

\section{Annulment}


\subsection{Branch Mispredictions}

Branch mispredictions form the bulk of all mis-speculated operations.
Whenever the actual RIP returned by a branch uop differs from the
\texttt{\footnotesize riptaken} field of the uop, the branch has been
mispredicted. This means all uops after (but \emph{not} including)
the branch must be annulled and removed from all processor structures.
The fetch queue (Section \ref{sec:FetchStage}) is then reset and
fetching is redirected to the correct branch target. However, all
uops in program order before the branch are still correct and may
continue executing.

Note that we do \emph{not} just reissue the branch: this would be
pointless, as we already know the correct RIP since the branch uop
itself has already executed once. Instead, we let it writeback and
commit as if it were predicted correctly.


\subsection{\label{sec:SpeculationRecovery}Annulment Process}

In PTLsim, the \texttt{\footnotesize ReorderBufferEntry::annul()}
method removes any and all ROBs that entered the pipeline after and
optionally including the misspeculated uop (depending on the \texttt{\small keep\_misspec\_uop}
argument). Because this method moves all affected ROBs to the free
state, they are instantly taken out of consideration for future pipeline
stages and will be dropped on the next cycle.

We must be extremely careful to annul all uops in an x86 macro-op;
otherwise half the x86 instruction could be executed twice once refetched.
Therefore, if the first uop to annul is not also the first uop in
the x86 macro-op, we may have to scan backwards in the ROB until we
find the first uop of the macro-op. In this way, we ensure that we
can annul the entire macro-op. All uops comprising the macro-op are
guaranteed to still be in the ROB since none of the uops can commit
until the entire macro-op can commit. Note that this does not apply
if the final uop in the macro-op is a branch and that branch uop itself
is being retained as occurs with mispredicted branches.

The first uop to annul is determined in the \texttt{\footnotesize annul()}
method by scanning backwards in time from the excepting uop until
a uop with its SOM (start of macro-op) bit is set, as described in
Section \ref{sec:UopIntro}. This SOM uop represents the boundary
between x86 instructions, and is where we start annulment. The end
of the range of uops to annul is at the tail of the reorder buffer.

We have to reconstruct the speculative RRT as it existed just before
the first uop to be annulled was renamed. This is done by calling
the \texttt{\footnotesize pseudocommit()} method of each annulled
uop to implement the {}``fast flush with pseudo-commit'' algorithm
as follows. First, we overwrite the speculative RRT with the committed
RRT. We then \emph{simulate} the commitment of all non-speculative
ROBs up to the first uop to be annulled by updating the speculative
RRT as if it were the commit RRT. This brings the speculative RRT
to the same state as if all in flight nonspeculative operations before
the first uop to be annulled had actually committed. Fetching is then
resumed at the correct RIP, where new uops are renamed using the recovered
speculative RRT.

Other methods of RRT reconstruction (like backwards walk with saved
checkpoint values) are difficult to impossible because of the requirement
that flag rename tables be restored even if some of the required physical
registers with attached flags have since been freed. Technically RRT
checkpointing could be used but due to the load/store replay mechanism
in use, this would require a checkpoint at every load and store as
well as branches. Hence, the forward walk method seems to offer the
best performance in practice and is quite simple. The Pentium 4 is
believed to use a similar method of recovering from some types of
mis-speculations.

After reconstructing the RRT, for each ROB to annul, we broadcast
the ROB index to the appropriate cluster's issue queue, allowing the
issue queue to purge the slot of the ROB being annulled. Finally,
for each annulled uop, we free any resources allocated to it (i.e.,
the ROB itself, the destination physical register, the load/store
queue entry (if any) and so on. Any updates to the branch predictor
and return address stack made during the speculative execution of
branches are also rolled back.

Finally, the fetch unit is restarted at the correct RIP and uops enter
the pipeline and are renamed according to the recovered rename tables
and allocated resource maps.


\chapter{\label{sec:IssuingLoads}Load Issue}


\section{\label{sec:AddressGeneration}Address Generation}

Loads and stores both have their physical addresses computed using
the \texttt{\footnotesize ReorderBufferEntry::addrgen()} method, by
adding the \texttt{\footnotesize ra} and \texttt{\footnotesize rb}
operands. If the load or store is one of the special unaligned fixup
forms (\texttt{\footnotesize ld.lo}, \texttt{\footnotesize ld.hi},
\texttt{\footnotesize st.lo}, \texttt{\footnotesize st.hi}) described
in Section \ref{sub:UnalignedLoadsAndStores}, the address is re-aligned
according to the type of instruction.

At this point, the \texttt{\footnotesize check\_and\_translate()}
method is used to translate the virtual address into a mapped physical
address using the page tables and TLB. The function of this method
varies significantly between userspace-only PTLsim and full system
PTLsim/X. In userspace-only PTLsim, the shadow page access tables
(Section \ref{sec:AddressSpaceSimulation}) are used to do access
checks; the same virtual address is then returned to use as a physical
address. In full system PTLsim/X, the real x86 page tables are used
to produce the physical address, significantly more involved checks
are done, and finally a pointer into PTLsim's mapping of all physical
pages is returned (see Section \ref{sub:FullSystemPageTranslation}).

If the virtual address is invalid or not present for the specified
access type, \texttt{\footnotesize check\_and\_translate()} will return
a null pointer. At this point, \texttt{\footnotesize handle\_common\_load\_store\_exceptions()}
is called to take action as follows.

If a given load or store accesses an unaligned address but is not
one of the special \texttt{\footnotesize ld.lo}/\texttt{\footnotesize ld.hi}/\texttt{\footnotesize st.lo}/\texttt{\footnotesize st.hi}
uops described in Section \ref{sub:UnalignedLoadsAndStores}, the
processor responds by first setting the {}``\texttt{\footnotesize unaligned}''
bit in the original \texttt{\footnotesize TransOp} in the basic block
cache, then it annuls all uops after and including the problem load,
and finally restarts the fetch unit at the RIP address of the load
or store itself. When the load or store uop is refetched, it is transformed
into a pair of \texttt{\footnotesize ld.lo}/\texttt{\footnotesize ld.hi}
or \texttt{\footnotesize st.lo}/\texttt{\footnotesize st.hi} uops
in accordance with Section \ref{sub:UnalignedLoadsAndStores}. This
refetch approach is required rather than a simple replay operation
since a replay would require allocating two entries in the issue queue
and potentially two ROBs, which is not possible with the PTLsim design
once uops have been renamed.

If a load or store would cause a page fault for any reason, the \texttt{\footnotesize check\_and\_translate()}
function will fill in the \texttt{\footnotesize exception} and \texttt{\footnotesize pfec}
(Page Fault Error Code) variables. These two variables are then placed
into the low and high 32 bits, respectively, of the 64-bit result
in the destination physical register or store buffer, in place of
the actual data. The load or store is then aborted and execution returns
to the \texttt{\footnotesize ReorderBufferEntry::issue()} method,
causing the result to be marked with an exception (\texttt{\footnotesize EXCEPTION\_PageFaultOnRead}
or \texttt{\footnotesize EXCEPTION\_PageFaultOnWrite}).

One x86-specific complication arises at this point. If a load (or
store) uop is the high part (\texttt{\footnotesize ld.hi} or \texttt{\footnotesize st.hi})
of an unaligned load or store pair, but the actual user address did
not overlap any of the high 64 bits accessed by the \texttt{\footnotesize ld.hi}
or \texttt{\footnotesize st.hi} uop, the load or store should be completely
ignored, even if the high part overlapped onto an invalid page. This
is because it is perfectly legal to do an unaligned load or store
at the very end of a page such that the next 64 bit chunk is not mapped
to a valid page; the x86 architecture mandates that the load or store
execute correctly as far as the user program is concerned.


\section{Store Queue Check and Store Dependencies}

After doing these exception checks, the load/store queue (LSQ) is
scanned backwards in time from the current load's entry to the LSQ's
head. If a given LSQ entry corresponds to a store, the store's address
has been resolved and the memory range needed by the load overlaps
the memory range touched by the store, the load effectively has a
dependency on the earlier store that must be resolved before the load
can issue. The meaning of {}``overlapping memory range'' is defined
more specifically in Section \ref{sec:StoreMerging}.

In some cases, the addresses of one or more prior stores that a load
may depend on may not have been resolved by the time the load issues.
Some processors will stall the load uop until \emph{all} prior store
addresses are known, but this can decrease performance by incorrectly
preventing independent loads from starting as soon as their address
is available. For this reason, the PTLsim processor model aggressively
issues loads as soon as possible unless the load is predicted to frequently
alias another store currently in the pipeline. This load/store aliasing
prediction technique is described in Section \ref{sub:AliasCheck}.

In either of the cases above, in which an overlapping store is identified
by address but that store's data is not yet available for forwarding
to the load, or where a prior store's address has not been resolved
but is \emph{predicted} to overlap the load, the load effectively
has a data flow dependency on the earlier store. This dependency is
represented by setting the load's fourth \texttt{\small rs} operand
(\texttt{\small operands{[}RS]} in the \texttt{\small ReorderBufferEntry})
to the store the load is waiting on. After adding this dependency,
the \texttt{\small replay()} method is used to force the load back
to the dispatched state, where it waits until the prior store is resolved.
After the load is re-issued for a second time, the store queue is
scanned again to make sure no intervening stores arrived in the meantime.
If a different match is found this time, the load is replayed a third
time. In practice, loads are rarely replayed more than once.


\section{Data Extraction}

Once the prior store a load depends on (if any) is ready and all the
exception checks above have passed, it is time to actually obtain
the load's data. This process can be complicated since some bytes
in the region accessed by the load could come from the data cache
while other bytes may be forwarded from a prior store. If one or more
bytes need to be obtained from the data cache, the L1 cache is probed
(via the \texttt{\footnotesize caches.probe\_cache\_and\_sfr()} function)
to see if the required line is present. If so, and the combination
of the forwarded store (if any) and the L1 line fills in all bytes
required by the load, the final data can be extracted.

To extract the data, the load unit creates a 64-bit temporary buffer
by overlaying the bytes touched by the prior store (if any) on top
of the bytes obtained from the cache (i.e., the bytes at the mapped
address returned by the \texttt{\footnotesize addrgen()} function).
The correct word is then extracted and sign extended (if required)
from this buffer to form the result of the load. Unaligned loads (described
in Section \ref{sub:UnalignedLoadsAndStores}) are somewhat more complex
in that both the low and high 64 bit chunks from the \texttt{\footnotesize ld.lo}
and \texttt{\footnotesize ld.hi} uops, respectively, are placed into
a 128-bit buffer from which the final result is extracted.

For simulation purposes only, the data to load is immediately accessed
and recorded by \texttt{\footnotesize issueload()} regardless of whether
or not there is a cache miss. This makes the loaded data significantly
easier to track. In a real processor, the data extraction process
obviously only happens after the missing line actually arrives, however
our implementation in no way affects performance.


\section{\label{sec:CacheMissHandling}Cache Miss Handling}

If no combination of the prior store's forwarded bytes and data present
in the L1 cache can fulfill a load, this is miss and lower cache levels
must be accessed. This process is described in Sections \ref{sec:InitiatingCacheMiss}
and \ref{sec:FillingCacheMiss}. As far as the core is concerned,
the load is completed at this point even if the data has not yet arrived.
The issue queue entry for the load can be released since the load
is now officially in progress and cannot be replayed. Once the loaded
data has arrived, the cache subsystem calls the \texttt{\footnotesize OutOfOrderCoreCacheCallbacks::dcache\_wakeup()}
function, which marks both the physical register and LSQ entry of
the load as ready, and places the load's ROB into the \emph{completed}
state. This allows the processor to wake up dependents of the load
on the next cycle.


\chapter{Stores}


\section{\label{sec:StoreMerging}Store to Store Forwarding and Merging}

In the PTLsim out of order model, a given store may merge its data
with that of a previous store in program order. This ensures that
loads which may need to forward data from a store always reference
exactly one store queue entry, rather than having to merge data from
multiple smaller prior stores to cover the entire byte range being
loaded. In this model, physical memory is divided up into 8 byte (64
bit) chunks. As each store issues, it scans the store queue backwards
in program order to find the most recent prior store to the same 8
byte aligned physical address. If there is a match, the current store
depends on the matching prior store, and cannot complete and forward
its data to other consuming loads and stores until the prior store
in question also completes. This ensures that the current store's
data can be composited on top of the older store's data to form a
single up to date 8-byte chunk. As described in Section \ref{sec:LoadStoreQueueEntry},
each store queue entry contains a byte mask to indicate which of the
8 bytes in each chunk are currently modified by stores in flight versus
those bytes which must come from the data cache.

Technically there are more efficient approaches, such as allowing
stores to issue in any order so long as they do not overlap on the
basis of individual bytes. However, no modern processor allows such
arbitrary forwarding since the circuit complexity involved with scanning
the store queue for partial address matches would be prohibitive and
slow. Instead, most processors only support store to load forwarding
when a single larger prior store covers the entire byte range accessed
by a smaller or same sized load; all other combinations stall the
load until the overlapping prior stores commit to the data cache. 

The store inheritance scheme used by PTLsim (described first) is an
improvement to the more common {}``stall on size mismatch'' scheme
above, but may incur more store dependency replays (since stores now
depend on other stores when they target the same 8-byte chunk) compared
to a stall on size mismatch scheme. As a case study, the Pentium 4
processor (Prescott core) implements a combination of these approaches.


\section{\label{sec:SplitPhaseStores}Split Phase Stores}

The \texttt{\footnotesize ReorderBufferEntry::issuestore()} function
is responsible for issuing all store uops. Stores are unusual in that
they can issue even if their \texttt{\footnotesize rc} operand (the
value to store) is not ready at the same time as the \texttt{\small ra}
and \texttt{\small rb} operands forming the effective address. This
property is useful since it allows a store to establish an entry in
the store queue as soon as the effective address can be generated,
even if the data to store is not ready. By establishing addresses
in the store queue as soon as possible, we can avoid performance losses
associated with the unnecessary replay of loads that may depend on
a store whose address is unavailable at the time the load issues.
In effect, this means that each store uop may actually issue twice.

In the first phase issue, which occurs as soon as the \texttt{\footnotesize ra}
and \texttt{\footnotesize rb} operands become ready, the store uop
computes its effective physical address, checks that address for all
exceptions (such as alignment problems and page faults) and writes
the address into the corresponding \texttt{\footnotesize LoadStoreQueueEntry}
structure before setting its the \texttt{\footnotesize addrvalid}
bit as described in Section \ref{sec:LoadStoreQueueEntry}. If an
exception is detected at this point, the \texttt{\footnotesize invalid}
bit in the store queue entry is set and the destination physical register's
\texttt{\footnotesize FLAG\_inv} flag is set so any attempt to commit
the store will fail.


\subsection{\label{sub:AliasCheck}Load Queue Search (Alias Check)}

The load queue is then searched to find any loads after the current
store in program order which have already issued but have done so
without forwarding data from the current store. These loads erroneously
issued before the current store (now known to overlap the load's address)
was able to forward the correct data to the offending load(s). This
situation is known as \emph{aliasing}, and is effectively a mis-speculation
requiring us to reissue any uops depending on the store. The redispatch
method (Section \ref{sec:Redispatch}) is used to re-execute only
those uops dependent (either directly or indirectly) on the store.

Since the redispatch process required to correct aliasing violations
is expensive and may result in infinite loops, it is desirable to
predict in advance which loads and stores are likely to alias each
other such that loads predicted to alias are never issued when prior
stores in the store queue still have unknown addresses. This works
because in most out of order processors, statistically speaking, very
few loads alias stores compared to normal loads from the cache. When
an aliasing mis-speculation occurs, an entry is added to a small fully
associative structure (typically $\le16$ entries) called the Load
Store Alias Predictor (LSAP). This structure is indexed by a portion
of the address of the load instruction that aliased. This allows the
load unit to avoid issuing any load uop that matches any address in
the LSAP if any prior store addresses are still unresolved; if this
is the case, a dependency is created on the first unresolved store
such that the load is replayed (and the load and store queues are
again scanned) once that store resolves. Similar methods of aliasing
prediction are used by the Pentium 4 (Prescott core only) and Alpha
21264.


\subsection{Store Queue Search (Merge Check)}

At this point the store queue is searched for prior stores to the
same 8-byte block as described above in Section \ref{sec:StoreMerging};
if the store depends on a prior store, the scheduler structures are
updated to add an additional dependency (in \texttt{\small operands{[}RS]})
on this prior store before the store is replayed in accordance with
Section \ref{sec:Scheduling} to wait for the prior store to complete.
If no prior store is found, or the prior store is ready, the current
store is marked as a second phase store by setting the \texttt{\small load\_store\_second\_phase}
flag in its ROB entry. Finally, the store is replayed in accordance
with Section \ref{sec:Scheduling}.

In the second phase of store uop scheduling, the store uop is only
re-issued when all four operands (\texttt{\small ra} + \texttt{\small rb}
address, \texttt{\small rc} data and \texttt{\small rs} source store
queue entry) are valid. The second phase repeats the scan of the load
and store queues described above to catch any loads and stores that
may have issued between the first and second phase issues; the store
is replayed a third time if necessary. Otherwise, the \texttt{\small rc}
operand data is merged with the data from the prior store (if any)
store queue entry, and the combined data and bytemask is written into
the current store's store queue entry. Finally, the entry's \texttt{\small dataready}
bit is set to make the entry available for forwarding to other waiting
loads and stores.

The first and second phases may be combined into a single issue without
replay if both the address and data operands of the store are all
ready at the same time and the prior store (if any) the current store
inherits from has already successfully issued.


\chapter{Forwarding, Wakeup and Writeback}


\section{Forwarding and the Clustered Bypass Network}

Immediately after each uop is issued and the \texttt{\footnotesize ReorderBufferEntry::issue()}
method actually generates its result, the \texttt{\footnotesize cycles\_left}
field of the ROB is set to the expected latency of the uop (e.g. between
1 and 5 cycles). The uop is then moved to the \emph{issued} state
and placed on the \texttt{\footnotesize rob\_issued\_list}. Every
cycle, the \texttt{\footnotesize complete()} method iterates through
each ROB in issued state and decrements its \texttt{\footnotesize cycles\_left}
field. If \texttt{\footnotesize cycles\_left} becomes zero, the corresponding
uop has completed execution. The ROB is moved to the \emph{completed}
state (on \texttt{\footnotesize rob\_completed\_list}) and its physical
register or store queue entry is moved to the \texttt{\small bypass}
state so newly dispatched uops do not try to wait for it.

The \texttt{\footnotesize transfer()} function is also called every
cycle. This function examines the list of ROBs in the \emph{completed}
state and is responsible for broadcasting the completed ROB's tag
(ROB index) to the issue queues. Because of clustering (Section \ref{sec:Clustering}),
some issue queues will receive the broadcast later than others. Specifically,
the ROB's \texttt{\footnotesize forward\_cycle} field determines which
issue queues and remote clusters are visible \texttt{\footnotesize forward\_cycle}
cycles after the uop completed. The \texttt{\footnotesize forward()}
method, called by \texttt{\footnotesize transfer()} for each uop in
the \emph{completed} state, indexes into a lookup table \texttt{\footnotesize forward\_at\_cycle\_lut{[}}\emph{cluster}\texttt{\footnotesize ]{[}}\emph{forward\_cycle}\texttt{\footnotesize ]}
to get a bitmap of which remote clusters are accessible \texttt{\footnotesize forward\_cycle}
cycles after he uop completed, relative to the original cluster.the
uop issued in. The \texttt{\small IssueQueue::broadcast()} method
(Section \ref{sec:Scheduling}) is then called for each applicable
cluster to wake up any operands of uops in that cluster waiting on
the newly completed uop.

The \texttt{\footnotesize MAX\_FORWARDING\_LATENCY} constant (in \texttt{\footnotesize ooocore.h})
specifies the maximum number of cycles between any two clusters. After
the ROB has progressed through \texttt{\small MAX\_FORWARDING\_LATENCY}
cycles in the \emph{completed} state, it is moved to the \texttt{\footnotesize ready-to-writeback}
state, effectively meaning the result has arrived at the physical
register file and is eligible for writeback in the next cycle.


\section{Writeback}

Every cycle, the \texttt{\footnotesize writeback()} function scans
the list of ROBs in the \emph{ready-to-writeback} state and selects
at most \texttt{\footnotesize WRITEBACK\_WIDTH} results to write to
the physical register file. The \texttt{\footnotesize forward()} method
is first called one final time to catch the corner case in which a
dependent uop was dispatched while producer uop was waiting in the
\emph{ready-to-writeback} state.

As mentioned in Section \ref{sec:Issue}, for simulation purposes
only, each uop puts its result directly into its assigned physical
register at the time of issue, even though the data technically does
not appear there until writeback. This is done to simplify the simulator
implementation; it is assumed that any data {}``read'' from physical
registers before writeback is in fact being read from the bypass network
instead. Therefore, no actual data movement occurs in the \texttt{\footnotesize writeback()}
function; its sole purpose is to place the uop's physical register
into the written state (via the \texttt{\footnotesize PhysicalRegister::writeback()}
method) and to move the ROB into its terminal state, \emph{ready-to-commit}.


\chapter{\label{sec:CommitStage}Commitment}


\section{Introduction}

The commit stage examines uops from the head of the ROB, blocks until
all uops comprising a given x86 instruction are ready to commit, commits
the results of those uops to the architectural state and finally frees
the resources associated with each uop.


\section{Atomicity of x86 instructions}

The x86 architecture specifies \emph{atomic execution} for all distinct
x86 instructions. This means that since each x86 instruction may be
comprised of multiple uops; none of these uops may commit until \emph{all}
uops in the instruction are ready to commit. In PTLsim, this is accomplished
by checking if the uop at the head of the ROB (next to commit) has
its SOM (start of macro-op) bit set. If so, the ROB is scanned forwards
from the SOM uop to the next uop in program order with its EOM (end
of macro-op) bit set. If all uops in this range are ready to commit
and exception-free, the SOM uop is allowed to commit, effectively
unlocking the ROB head pointer until the next uop with a SOM bit set
is encountered. However, any exception in any uop comprising the x86
instruction at the head of the ROB causes the pipeline to be flushed
and an exception to be taken. Similarly, external interrupts are only
acknowledged at the boundary between x86 instructions (i.e. after
the EOM uop of each instruction).


\section{Commitment}

As each uop commits, it may update several components of the architectural
state. 

Integer ALU and floating point uops obviously update their destination
architectural register (\emph{rd}). In PTLsim, this is done by simply
updating the committed register rename table (\texttt{\footnotesize commitrrt})
rather than actually copying register values. However, the old physical
register mapped to architectural register \emph{rd} will normally
become inaccessible after the Commit RRT mapping for \emph{rd} is
overwritten with the committing uop's physical register index. The
old physical register previously mapped to \emph{rd} can then be freed.
Technically physical registers allocated to intermediate uops (such
as those used to hold temporary values) can be immediately freed without
updating any Commit RRT entries, but for consistency we do not do
this.

In PTLsim, a physical register is freed by moving it to the \texttt{\footnotesize PHYSREG\_FREE}
state. Unfortunately for various reasons related to long pipelines
and the renaming of x86 flags, register reclamation is not so simple,
but this will be discussed below in Section \ref{sub:PhysicalRegisterRecyclingComplications}.

Some uops may also commit to a subset of the x86 flags, as specified
in the uop encoding. For these uops, in theory no rename tables need
updating, since the flags can be directly masked into the \texttt{\footnotesize REG\_flags}
architectural pseudo-register. Should the pipeline be flushed, the
rename table entries for the ZAPS, CF, OF flag sets will all be reset
to point to the \texttt{\footnotesize REG\_flags} pseudo-register
anyway. However, for the speculation recovery scheme described in
Section \ref{sec:SpeculationRecovery}, the \texttt{\footnotesize REG\_zf},
\texttt{\footnotesize REG\_cf}, and \texttt{\footnotesize REG\_of}
commit RRT entries are updated as well to match the updates done to
the speculative RRT.

Branches and jumps update the \texttt{\footnotesize REG\_rip} pseudo
architectural register, while all other uops simply increment \texttt{\footnotesize REG\_rip}
by the number of bytes in the x86 instruction being committed. The
number of bytes (1-15) is stored in a 4-bit \texttt{\footnotesize bytes}
field of each uop in each x86 instruction.

Stores commit to the architectural state by writing directly to the
data cache, which in PTLsim is equivalent to writing into real physical
memory. Remember that a series of stores into a given 64-bit chunk
of memory are merged within the store queue to the store uop's corresponding
STQ entry as the store uop issues, so the commit unit always writes
64 bits to the cache at a time. The byte mask associated with the
STQ entry of the store uop is used to only update the modified bytes
in each chunk of memory in program order.


\section{Additional Commit Actions for Full System Use}

In full system PTLsim/X, several additional actions must be taken
at commit time:

\begin{itemize}
\item Self modifying code checks must be done using \texttt{\footnotesize smc\_isdirty(mfn)},
as described in Section \ref{sec:SelfModifyingCode}.
\item Stores must set the dirty bit on the target physical page, using the
\texttt{\footnotesize smc\_setdirty(mfn)} function (so as to properly
notify subsequent instructions of self modifying code).
\item The x86 page table accessed and dirty bits must be updated whenever
a load or store commits, using the \texttt{\footnotesize Context.update\_pte\_acc\_dirty()}
function.
\item If an interrupt is pending, and we have just committed the last uop
in an atomic x86 instruction, we can now safely service it.
\end{itemize}

\section{\label{sub:PhysicalRegisterRecyclingComplications}Physical Register
Recycling Complications}


\subsection{Problem Scenarios}

In some processor designs, it is not always possible to immediately
free the physical register mapped to a given architectural register
when that old architectural register mapping is overwritten during
commit as described above. Out of order x86 processors must maintain
three separate rename table entries for the ZAPS, CF, OF flags in
addition to the register rename table entry, any or all of which may
be updated when uops rename and retire, depending on the uop's flag
renaming semantics (see Section \ref{sub:FlagsManagement}), For this
reason, even though a given physical register value may become inaccessible
and hence dead at commit time, the flags associated with that physical
register are frequently still referenced within the pipeline, so the
physical register itself must remain allocated.

Consider the following specific example, with uops listed in program
order:

\begin{itemize}
\item \texttt{\footnotesize sub rax = rax,rbx}\\
Assign RRT{[}\texttt{\footnotesize rax}] = phys reg r0\\
Assign RRT{[}\texttt{\footnotesize flags}] = \emph{r0} (since SUB
all updates flags)
\item \texttt{\footnotesize mov rax = rcx}\\
Assign RRT{[}\texttt{\footnotesize rax}] = phys reg r1\\
\emph{No flags renamed:} MOV never updates flags, so RRT{[}\texttt{\small flags}]
is still \emph{r0}.
\item \texttt{\footnotesize br.e target}\\
Depends on flags attached to \emph{r0}, even though actual architectural
register (\texttt{\footnotesize rax}) for \emph{r0} has already been
overwritten in the commit RRT by the MOV's commit. We cannot free
\emph{r0} since the BR uop might not have issued yet.
\end{itemize}
This situation only happens with instruction sets like x86 (and SPARC
or even PowerPC to some extent) which support writing flags (particularly
multiple independent flags) and data in a single instruction.


\subsection{Reference Counting}

For these reasons, we need to prevent U2's register from being freed
if it is still referenced by anything still in the pipeline; the normal
reorder buffer mechanism cannot always handle this situation in a
very long pipeline.

One solution (the one used by PTLsim) is to give each physical register
a reference counter. Physical registers can be referenced from three
structures: as operands to ROBs, from the speculative RRT, and from
the committed RRT. As each uop operand is renamed, the counter for
the corresponding physical register is incremented by calling the
\texttt{\footnotesize PhysicalRegister::addref()} method. As each
uop commits, the counter for each of its operands is decremented via
the \texttt{\footnotesize PhysicalRegister::unref()}{\footnotesize{}
}method. Similarly, \texttt{\footnotesize unref()}{\footnotesize{}
}and \texttt{\footnotesize addref()}{\footnotesize{} }are used whenever
an entry in the speculative RRT or commit RRT is updated. During mis-speculation
recovery (see Section \ref{sec:SpeculationRecovery}), \texttt{\footnotesize unref()}{\footnotesize{}
}is also used to unlock the operands of uops slated for annulment.
Finally, \texttt{\footnotesize unref()}{\footnotesize{} }and \texttt{\footnotesize addref()}{\footnotesize{}
}are used when loads and stores need to add a new dependency on a
waiting store queue entry (see Sections \ref{sec:IssuingLoads} and
\ref{sec:SplitPhaseStores}).

As we update the committed RRT during the commit stage, the old register
R mapped to the destination architectural register A of the uop being
committed is examined. The register R is only moved to the \emph{free}
state iff its reference counter is zero. Otherwise, it is moved to
the \emph{pendingfree} state. The hardware examines the counters of
\emph{pendingfree} physical registers every cycle and moves physical
registers to the \emph{free} state only when their counters become
zero and they are in the \emph{pendingfree} state.


\subsection{Hardware Implementation}

The hardware implementation of this scheme is straightforward and
low complexity. The counters can have a very small number of bits
since it is very unlikely a given physical register would be referenced
by all 100+ uops in the ROB; 3 bits should be enough to handle the
typical maximum of < 8 uops sharing a given operand. Counter overflows
can simply stall renaming or flush the pipeline since they are so
rare.

The counter table can be updated in bulk each cycle by adding/subtracting
the appropriate sum or just adding zero if the corresponding register
wasn't used. Since there are several stages between renaming and commit,
the same counter is never both incremented and decremented in the
same cycle, so race conditions are not an issue. 

In real processors, the Pentium 4 uses a scheme similar to this one
but uses bit vectors instead. For smaller physical register files,
this may be a better solution. Each physical register has a bit vector
with one bit per ROB entry. If a given physical register P is still
used by ROB entry E in the pipeline, P's bit vector bit R is set.
Register P cannot be freed until all bits in its vector are zero.


\section{\label{sec:PipelineFlushesAndBarriers}Pipeline Flushes and Barriers}

In some cases, the entire pipeline must be empty after a given uop
commits. For instance, a \emph{barrier} uop, represented by any \texttt{\footnotesize br.p}
(branch private) uop, will stall the frontend when first renamed,
and when committed (at which point it is the only uop in the pipeline),
it will call \texttt{\footnotesize flush\_pipeline()}{\footnotesize{}
}to restart fetching at the appropriate RIP. Exceptions have a similar
effect when they reach the commit stage. After doing this, the current
architectural registers must be copied into the externally visible
\texttt{\footnotesize ctx.commitarf{[}]} array, since normally the
architectural registers are scattered throughout the physical register
file. Fortunately, the commit stage also updates \texttt{\footnotesize ctx.commitarf{[}]}
in parallel with the commit RRT, even though the \texttt{\small commitarf}
array is never actually read by the out of order core. Interrupts
are a special case of barriers, the difference being they can be serviced
after \emph{any} x86 instruction commits its last uop.

At this point, the \texttt{\footnotesize handle\_barrier()}, \texttt{\footnotesize handle\_exception()}
or \texttt{\footnotesize handle\_interrupt()} function is called to
actually communicate with the world outside the out of order core.
In the case of \texttt{\footnotesize handle\_barrier()}, generally
this involves executing native code inside PTLsim to redirect execution
into or out of the kernel, or to service a very complex x86 instruction
(e.g. \texttt{\footnotesize cpuid}, floating point save or restore,
etc). For \texttt{\footnotesize handle\_exception()}, on userspace-only
PTLsim, the simulation is stopped and the user is notified that a
genuine user visible (non-speculative) exception reached the commit
stage. In contrast, on full system PTLsim/X, exceptions are little
more than jumps into kernel space; this is described in detail in
Chapter \ref{sec:PTLsimXArchitectureDetails}.

If execution can continue after handling the barrier or exception,
the \texttt{\footnotesize external\_to\_core\_state()} function is
called to completely reset the out of order core using the state stored
in \texttt{\footnotesize ctx.commitarf{[}]}. This involves allocating
a fixed physical register for each of the 64 architectural registers
in \texttt{\footnotesize ctx.commitarf{[}]}, setting the speculative
and committed rename tables to their proper cold start values, and
resetting all reference counts on physical registers as appropriate.
If the processor is configured with multiple physical register files
(Section \ref{sec:PhysicalRegisters}), the initial physical register
for each architectural register is allocated in the first physical
register file only (this is configurable by modifying \texttt{\footnotesize external\_to\_core\_state()}).
At this point, the main simulation loop can resume as if the processor
had just restarted from scratch.


\chapter{\label{sec:CacheHierarchy}Cache Hierarchy}

The PTLsim cache hierarchy model is highly flexible and can be used
to model a wide variety of contemporary cache structures. The cache
subsystem (defined in \texttt{\small dcache.h} and implemented by
\texttt{\small dcache.cpp}) by default consists of four levels:

\begin{itemize}
\item \textbf{L1 data cache} is directly probed by all loads and stores
\item \textbf{L1 instruction cache} services all instruction fetches
\item \textbf{L2 cache} is shared between data and instructions, with data
paths to both L1 caches
\item \textbf{L3 cache} is also shared and is optionally present
\item \textbf{Main memory} is considered infinite in size but still has
configurable characteristics
\end{itemize}
These cache levels are listed in order from highest level (closer
to the core) to lowest level (far away). The cache hierarchy is assumed
to be \emph{inclusive}, i.e. any data in higher levels is assumed
to always be present in lower levels. Additionally, the cache levels
are generally \emph{write-through}, meaning that every store updates
all cache levels, rather than waiting for a dirty line to be evicted.
PTLsim supports a 48-bit virtual address space and 40-bit physical
addresses (full system PTLsim/X only) in accordance with the x86-64
standard.


\section{General Configurable Parameters}

All caches support configuration of:

\begin{itemize}
\item Line size in bytes. Any power of two size is acceptable, however the
line size of a lower cache level must be the same or larger than any
line size of a higher level cache. For example, it is illegal to have
128 byte L1 lines with 64 byte L2 lines.
\item Set count may be any power of two number. The total cache size in
bytes is of course (line size) $\times$ (set count)$\times$ (way
count)
\item Way count (associativity) may be any number from 1 (direct mapped)
up to the set count (fully associative). Note that simulation performance
(and clock speed in a real processor) will suffer if the associativity
is too great, particularly for L1 caches.
\item Latency in cycles from a load request to the arrival of the data.
\end{itemize}
In \texttt{\footnotesize dcache.h}, the two base classes \texttt{\small CacheLine}
and \texttt{\small CacheLineWithValidMask} are interchangeable, depending
on the model being used. The \texttt{\small CacheLine} class is a
standard cache line with no actual data (since the bytes in each line
are simply held in memory for simulation purposes). 

The \texttt{\small CacheLineWithValidMask} class adds a bitmask specifying
which bytes within the cache line contain valid data and which are
unknown. This is useful for implementing {}``no stall on store''
semantics, in which stores simply allocate a new way in the appropriate
set but only set the valid bits for those bytes actually modified
by the store. The rest of the cache line not touched by the store
can be brought in later without stalling the processor (unless a load
tries to access them); this is PTLsim's default model. Additionally,
this technique may be used to implement sectored cache lines, in which
the line fill bus is smaller than the cache line size. This means
that groups of bytes within the line may be filled over subsequent
cycles rather than all at once.

The \texttt{\small AssociativeArray} template class in \texttt{\small logic.h}
forms the basis of all caches in PTLsim. To construct a cache in which
specific lines can be locked into place, the \texttt{\small LockableAssociativeArray}
template class may be used instead. Finally, the \texttt{\small CommitRollbackCache}
template class is useful for creating versions of PTLsim with cache
level commit/rollback support for out of order commit, fault recovery
and advanced speculation techniques.

The various caches are defined in \texttt{\small dcache.h} by specializations
of these template classes. The classes are \texttt{\small L1Cache},
\texttt{\small L1ICache}, \texttt{\small L2Cache} and \texttt{\small L3Cache}.


\section{\label{sec:InitiatingCacheMiss}Initiating a Cache Miss}

As described in Section \ref{sec:IssuingLoads}, in the out of order
core model, the \texttt{\small issueload()} function determines if
some combination of a prior store's forwarded bytes (if any) and data
present in the L1 cache can fulfill a load. If not, this is a miss
and lower cache levels must be accessed. In this case, a \texttt{\small LoadStoreInfo}
structure (defined in \texttt{\small dcache.h}) is prepared with various
metadata about the load, including which ROB entry and physical register
to wake up when the load arrives, its size, alignment, sign extension
properties, prefetch properties and so on. The \texttt{\small issueload\_slowpath()}
function (defined in \texttt{\small dcache.cpp}) is then called with
this information, the physical address to load and any data inherited
from a prior store still in the pipeline. The \texttt{\small issueload\_slowpath()}
function moves the load request out of the core pipeline and into
the cache hierarchy. 

The \emph{Load Fill Request Queue} (LFRQ) is a structure used to hold
information about any outstanding loads that have missed any cache
level. The LFRQ allows a configurable number of loads to be outstanding
at any time and provides a central control point between cache lines
arriving from the L2 cache or lower levels and the movement of the
requested load data into the processor core to dependent instructions.
The \texttt{\small LoadFillReq} structure, prepared by \texttt{\small issueload\_slowpath()},
contains all the data needed to return a filled load to the core:
the physical address of the load, the data and bytemask already known
so far (e.g. forwarded from a prior store) and the \texttt{\small LoadStoreInfo}
metadata described above.

The \emph{Miss Buffer} (MB) tracks all outstanding cache lines, rather
than individual loads. Each MB slot uses a bitmap to track one or
more LFRQ entries that need to be awakened when the missing cache
line arrives. After adding the newly created \texttt{\small LoadFillReq}
entry to the LFRQ, the \texttt{\small MissBuffer::initiate\_miss()}
method uses the missing line's physical address to allocate a new
slot in the miss buffer array (or simply uses an existing slot if
a miss was already in progress on a given line). In any case, the
MB's wakeup bitmap is updated to reflect the new LFRQ entry referring
to that line. Each MB entry contains a \texttt{\small cycles} field,
indicating the number of cycles remaining for that miss buffer before
it can be moved up the cache hierarchy until it reaches the core.
Each entry also contains two bits (\texttt{\small icache} and \texttt{\small dcache})
indicating which L1 caches to which the line should eventually be
delivered; this is required because a single L2 line (and corresponding
miss buffer) may be referenced by both the L1 data and instruction
caches. 

In \texttt{\small initiate\_miss()}, the L2 and L3 caches are probed
to see if they contain the required line. If the L2 has the line,
the miss buffer is placed into the \texttt{\small STATE\_DELIVER\_TO\_L1}
state, indicating that the line is now in progress to the L1 cache.
Similarly, an L2 miss but L3 hit results in the \texttt{\small STATE\_DELIVER\_TO\_L2}
state, and a miss all the way to main memory results in \texttt{\small STATE\_DELIVER\_TO\_L3}.

In the very unlikely event that either the LFRQ slot or miss buffer
are full, an exception is returned to out of order core, which typically
replays the affected load until space in these structures becomes
available. For prefetch requests, only a miss buffer is allocated;
no LFRQ slot is needed.


\section{\label{sec:FillingCacheMiss}Filling a Cache Miss}

The \texttt{\small MissBuffer::clock()} method implements all synchronous
state transitions. For each active miss buffer, the \texttt{\small cycles}
counter is decremented, and if it becomes zero, the MB's current state
is examined. If a given miss buffer was in the \texttt{\small STATE\_DELIVER\_TO\_L3}
state (i.e. in progress from main memory) and the cycle counter just
became zero, a line in the L3 cache is validated with the incoming
data (this may involve evicting another line in the same set to make
room). The MB is then moved to the next state up the cache hierarchy
(i.e. \texttt{\small STATE\_DELIVER\_TO\_L2} in this example) and
its cycles field is updated with the latency of the cache level it
is now leaving (e.g. \texttt{\small L3\_LATENCY} in this example). 

This process continues with successive levels until the MB is in the
\texttt{\small STATE\_DELIVER\_TO\_L1} state and its cycles field
has been decremented to zero. If the MB's \texttt{\small dcache} bit
is set, the L1 corresponding line is validated and the \texttt{\small lfrq.wakeup()}
method is called to invoke a new state machine to wake up any loads
waiting on the recently filled line (as known from the MB's \texttt{\small lfrqmap}
bitmap). If the MB's \texttt{\small icache} bit was set, the line
is validated in the L1 instruction cache, and the \texttt{\footnotesize PerCoreCacheCallbacks::icache\_wakeup()}
callback is used to notify the out of order core's fetch stage that
it may probe the cache for the missing line again. In any case, the
miss buffer is then returned to the unused state.

Each LFRQ slot can be in one of three states: \emph{free}, \emph{waiting}
and \emph{ready}. LFRQ slots remain in the \emph{waiting} state as
long as they are referenced by a miss buffer; once the \texttt{\small lfrq.wakeup()}
method is called, all slots affiliated with that miss buffer are moved
to the \emph{ready} state. The \texttt{\small LoadFillRequestQueue::clock()}
method finds up to \texttt{\small MAX\_WAKEUPS\_PER\_CYCLE} LFRQ slots
in the \emph{ready} state and wakes them up by calling the \texttt{\footnotesize PerCoreCacheCallbacks::dcache\_wakeup()}
callback with the saved \texttt{\small LoadStoreInfo} metadata. The
out of order core handles this callback as described in Section \ref{sec:CacheMissHandling}.

For simulation purposes only, the value to be loaded is immediately
recorded as soon as the load issues, independent of the cache hit
or miss status. In real hardware, the LFRQ entry data would be used
to extract the correct bytes from the newly arrived line and perform
sign extension and alignment. If the original load required bytes
from a mixture of its source store buffer and the data cache, the
SFR data and mask fields in the LFRQ entry would be used to perform
this merging operation. The data would then be written into the physical
register specified by the \texttt{\small LoadStoreInfo} metadata and
that register would be marked as ready before sending a signal to
the issue queues to wake up dependent operations.

In some cases, the out of order core may need to annul speculatively
executed loads. The cache subsystem is notified of this through the
\texttt{\small annul\_lfrq\_slot()} function called by the core. This
function clears the specified LFRQ slot in each miss buffer's lfrqmap
entry (since that slot should no longer be awakened now that it has
been annulled), and frees the LFRQ entry itself.


\section{\label{sec:TranslationLookasideBuffers}Translation Lookaside Buffers}

The following section applies to full system PTLsim/X only. The userspace
version of PTLsim does not model TLBs since doing so would be inaccurate:
it is physically impossible to model TLB miss delays without actually
walking real page tables and encountering the associated cache misses.
For more information, please see Section \ref{sub:FullSystemPageTranslation}
concerning page translation in PTLsim/X.


\chapter{\label{sec:BranchPrediction}Branch Prediction}


\section{Introduction}

PTLsim provides a variety of branch predictors in \texttt{\small branchpred.cpp}.
The branch prediction subsystem is relatively independent of the core
simulator and can be treated as a black box, so long as it implements
the interfaces in \texttt{\small branchpred.h}.

The branch prediction subsystem always contains at least three distinct
predictors for the three main classes of branches:

\begin{itemize}
\item \emph{Conditional Branch Predictor} returns a boolean (taken or not
taken) for each conditional branch (\texttt{\small br.cc} uop)
\item \emph{Branch Target Buffer} (BTB) predicts indirect branch (\texttt{\small jmp}
uop) targets
\item \emph{Return Address Stack} (RAS) predicts return instructions (i.e.
specially marked indirect \texttt{\small jmp} uops) based on prior
calls
\item Unconditional branches (\texttt{\small bru}) are never predicted since
their destination is explicitly encoded.
\end{itemize}
All these predictors are accessed by the core through the \texttt{\small BranchPredictorInterface}
object. Based on the opcode and other uop information, the core determines
the type flags of each branch uop:

\begin{itemize}
\item \texttt{\small BRANCH\_HINT\_UNCOND} for unconditional branches. These
are never predicted since the destination is implied.
\item \texttt{\small BRANCH\_HINT\_COND} for conditional branches.
\item \texttt{\small BRANCH\_HINT\_INDIRECT} for indirect branches, including
returns.
\item \texttt{\small BRANCH\_HINT\_CALL} for calls (both direct and indirect).
This implies that the return address of the call should be a should
be pushed on the RAS.
\item \texttt{\small BRANCH\_HINT\_RET} for returns (indirect branches).
This implies that the return address should be taken from the top
RAS stack entry, not the BTB.
\end{itemize}
Multiple flags may be present for each uop (for instance, \texttt{\small BRANCH\_HINT\_RET}
and \texttt{\small BRANCH\_HINT\_INDIRECT} are both used for the \texttt{\small jmp}
uop terminating an x86 \texttt{\small ret} instruction).

To make a prediction at fetch time, the core calls the \texttt{\small BranchPredictorInterface::predict()}
method, passing it a \texttt{\small PredictorUpdate} structure. This
structure is carried along with each uop until it retires, and contains
all the information needed to eventually update the branch predictor
at the end of the pipeline. The contents will vary depending on the
predictor chosen, but in general this structure contains pointers
into internal predictor counter tables and various flags. The \texttt{\small predict()}{\small{}
}method fills in this structure.

As each uop commits, the \texttt{\small BranchPredictorInterface::update()}
method is passed the uop's saved \texttt{\small PredictorUpdate} structure
and the branch outcome (expected target RIP versus real target RIP)
so the branch predictor can be updated. In PTLsim, predictor updates
only occur at retirement to avoid corruption caused by speculative
instructions.


\section{Conditional Branch Predictor}

The PTLsim conditional branch predictor is the most flexible predictor,
since it can be easily replaced. The default predictor implemented
in \texttt{\small branchpred.cpp} is a selection based predictor.
In essence, two separate predictors are maintained. The \emph{history
predictor} hashes a shift register of previously predicted branches
into a table slot; this slot returns whether or not the branch with
that history is predicted as taken. PTLsim supports various combinations
of the history and branch address to provide \emph{gshare} based semantics.
The \emph{bimodal predictor} is simpler; it uses 2-bit saturating
counters to predict if a given branch is likely to be taken. Finally,
a \emph{selection predictor} specifies which of the two predictors
is more accurate and should be used for future predictions. This style
of predictor, sometimes called a \emph{McFarling predictor}, has been
described extensively in the literature and variations are used by
most modern processors.

Through the \texttt{\small CombinedPredictor} template class, the
user can specify the sizes of all the tables (history, bimodal, selector),
the history depth, the method in which the global history and branch
address are combined and so on. Alternatively, the conditional branch
predictor can be replaced with something entirely different if desired.


\section{Branch Target Buffer}

The Branch Target Buffer (BTB) is essentially a small cache that maps
indirect branch RIP addresses (i.e., \texttt{\small jmp} uops) into
predicted target RIP addresses. It is set associative, with a user
configurable number of sets and ways. In PTLsim, the BTB does not
take into account any indirect branch history information. The BTB
is a nearly universal structure in branch prediction; see the literature
for more information.


\section{Return Address Stack}

The Return Address Stack (RAS) predicts the target address of indirect
jumps marked with the \texttt{\small BRANCH\_HINT\_RET} flag. Whenever
the \texttt{\small BRANCH\_HINT\_RET} flag is passed to the predict()
method, the top RAS stack entry is returned as the predicted target,
overriding anything in the BTB.

Unlike the conditional branch predictor and BTB, the RAS updated speculatively
in the frontend pipeline, before the outcome of calls and returns
are known. This allows better performance when closely spaced calls
and returns must be predicted as they are fetched, before either the
call or corresponding return have actually executed. However, when
called with the \texttt{\small BRANCH\_HINT\_RET} flag, the \texttt{\small predict()}
method only returns the RIP at the top of the RAS, but does not push
or pop the RAS. This must be done after the corresponding \texttt{\small bru}
or \texttt{\small jmp} (for direct and~or indirect calls, respectively)
or \texttt{\small jmp} (for returns) uop is actually allocated in
the ROB. 

This approach is required since the RAS is speculatively updated:
if uops must be annulled (because of branch mispredictions or mis-speculations),
the annulment occurs by walking backwards in the ROB until the excepting
uop is encountered. However, if the RAS were updated during the fetch
stage, some uops may not be in the ROB yet and hence the rollback
logic cannot undo speculative changes made to the RAS by these uops.
This causes the RAS to get out of alignment and performance suffers.

To solve this problem, the RAS is only updated in the allocate stage
immediately after fetch. In the out of order core's \texttt{\small rename()}
function, the \texttt{\small BranchPredictorInterface::updateras()}
method is called to either push or pop an entry from the RAS (calls
push entries, returns pop entries). Unlike the conditional branch
predictor and BTB, this is the only place the RAS is updated, rather
than performing updates at commit time.

If uops must be annulled, the \texttt{\small ReorderBufferEntry::annul()}
method calls the \texttt{\small BranchPredictorInterface::annulras()}
method with the \texttt{\small PredictorUpdate} structure for each
uop it encounters in reverse program order. This method effectively
undoes whatever change was made to the RAS when the \texttt{\small updateras()}
method was called with the same \texttt{\small PredictorUpdate} information
during renaming and allocation. This is possible because \texttt{\small updateras()}
saves checkpoint information (namely, the old RAS top of stack and
the value at that stack slot) before updating the RAS; this allows
the RAS state to be rolled backwards in time as uops are annulled
in reverse program order. At the end of the annulment process when
fetching is restarted at the correct RIP, the RAS state should be
identical to the state that existed before the last uop to be annulled
was originally fetched.


\part{\label{part:Appendices}Appendices}


\chapter{\label{sec:UopReference}PTLsim uop Reference}

The following sections document the semantics and encoding of each
micro-operation (uop) supported by the PTLsim processor core. The
\texttt{\small opinfo{[}]} table in \texttt{\small ptlhwdef.cpp} and
constants in \texttt{\small ptlhwdef.h} give actual numerical values
for the opcodes and other fields described below.

\newpage{}\texttt{\textbf{\large ~}}\textsf{}\\
\textsf{\Large Merging Rules}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{lll}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small op}} & \texttt{\small rd = ra,rb} & \textsf{rd = ra $\leftarrow$ (ra} \textsf{\emph{op}} \textsf{rb)}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Merging Rules:}}

\textsf{The x86 compatible ALUs implement operations on 1, 2, 4 or
8 byte quantities. Unless otherwise indicated, all operations take
a 2-bit size shift field (}\texttt{\small sz}\textsf{) used to determine
the effective size in bytes of the operation as follows:}

\begin{itemize}
\item \textsf{\textbf{sz = 0:}} \textsf{Low byte of} \textsf{\emph{rd}}
\textsf{is set to the 8-bit result; high 7 bytes of} \textsf{\emph{rd}}
\textsf{are set to corresponding bytes of} \textsf{\emph{ra}}\textsf{.}
\item \textsf{\textbf{sz = 1:}} \textsf{Low two bytes of} \textsf{\emph{rd}}
\textsf{is set to the 16-bit result; high 6 bytes of} \textsf{\emph{rd}}
\textsf{are set to corresponding bytes of} \textsf{\emph{ra}}\textsf{.}
\item \textsf{\textbf{sz = 2:}} \textsf{Low four bytes of} \textsf{\emph{rd}}
\textsf{is set to the 32-bit result; high 4 bytes of} \textsf{\emph{rd}}
\textsf{are cleared to zero in accordance with x86-64 zero extension
semantics. The} \textsf{\emph{ra}} \textsf{operand is unused and should
be} \texttt{\small REG\_zero}\textsf{.}
\item \textsf{\textbf{sz = 3:}} \textsf{All 8 bytes of} \textsf{\emph{rd}}
\textsf{are set to the 64-bit result.} \textsf{\emph{ra}} \textsf{is
unused and should be} \texttt{\small REG\_zero}\textsf{.}
\end{itemize}
\textsf{Flags are calculated based on the} \textsf{\emph{sz}}\textsf{-byte
value produced by the ALU, not the final 64-bit result in} \textsf{\emph{rd}}\textsf{.}

\bigskip{}
\textsf{\Large Other Pseudo-Operators}{\Large \lyxline{\Large}}{\Large \par}

\medskip{}
\textsf{The descriptions in this reference use various pseudo-operators
to describe the semantics of each uop. These operators are described
below.}

\textsf{\textbf{EvalFlags(}}\textsf{\textbf{\emph{ra}}}\textsf{\textbf{)}}

\textsf{The} \textsf{\emph{EvalFlags}} \textsf{pseudo-operator evaluates
the ZAPS, CF, OF flags attached to the source operand} \textsf{\emph{ra}}
\textsf{in accordance with the type of condition code evaluation specified
by the uop. The operator returns 1 if the evaluation is true; otherwise
0 is returned.}

\textsf{\textbf{SignExt(}}\textsf{\textbf{\emph{ra}}}\textsf{\textbf{,
N)}}

\textsf{The} \textsf{\emph{SignExt}} \textsf{operator sign extends
the ra operand by the number of bits specified by N. Specifically,
bit} \textsf{\emph{ra}}\textsf{{[}N] is copied to all high order bits
from bit 63 down to bit} \textsf{\emph{N}}\textsf{. If N is not specified,
it is assumed to mean the number of bits in the effective size of
the uop's result (as described under Merging Rules).}

\textsf{\textbf{MergeWithSFR(mem, sfr)}}

\textsf{The} \textsf{\emph{MergeWithSFR}} \textsf{pseudo-operator
is described in the reference page for load uops.}

\textsf{\textbf{MergeAlign(mem, sfr)}}

\textsf{The} \textsf{\emph{MergeAlign}} \textsf{pseudo-operator is
described in the reference page for load uops.}

\newpage{}\texttt{\textbf{\large mov and or xor andnot ornot nand
nor eqv}}\textsf{}\\
\textsf{\Large Logical Operations}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{llc}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small mov}} & \texttt{\small rd = ra,rb} & \textsf{rd = ra $\leftarrow$ rb}\tabularnewline
\texttt{\textbf{\small and}} & \texttt{\small rd = ra,rb} & \textsf{rd = ra $\leftarrow$ ra \& rb}\tabularnewline
\texttt{\textbf{\small or}} & \texttt{\small rd = ra,rb} & \textsf{rd = ra $\leftarrow$ ra | rb}\tabularnewline
\texttt{\textbf{\small xor}} & \texttt{\small rd = ra,rb} & \textsf{rd = ra $\leftarrow$ ra \textasciicircum{} rb}\tabularnewline
\texttt{\textbf{\small andnot}} & \texttt{\small rd = ra,rb} & \textsf{rd = ra $\leftarrow$ (\textasciitilde{}ra) \& rb}\tabularnewline
\texttt{\textbf{\small ornot}} & \texttt{\small rd = ra,rb} & \textsf{rd = ra $\leftarrow$ (\textasciitilde{}ra) | rb}\tabularnewline
\texttt{\textbf{\small nand}} & \texttt{\small rd = ra,rb} & \textsf{rd = ra $\leftarrow$ \textasciitilde{}(ra \& rb)}\tabularnewline
\texttt{\textbf{\small nor}} & \texttt{\small rd = ra,rb} & \textsf{rd = ra $\leftarrow$ \textasciitilde{}(ra | rb)}\tabularnewline
\texttt{\textbf{\small eqv}} & \texttt{\small rd = ra,rb} & \textsf{rd = ra $\leftarrow$ \textasciitilde{}(ra \textasciicircum{}
rb)}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{All operations merge the ALU result with} \textsf{\emph{ra}}
\textsf{and generate flags in accordance with the standard x86 merging
rules described previously.}
\end{itemize}
\newpage{}\texttt{\textbf{\large add sub addadd addsub subadd subsub
addm subm addc subc}}\textsf{}\\
\textsf{\Large Add and Subtract}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{lll}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small add}} & \texttt{\small rd = ra,rb} & \textsf{rd = ra $\leftarrow$ ra + rb}\tabularnewline
\texttt{\textbf{\small sub}} & \texttt{\small rd = ra,rb} & \textsf{rd = ra $\leftarrow$ ra - rb}\tabularnewline
\texttt{\textbf{\small adda}} & \texttt{\small rd = ra,rb,rc{*}S} & \textsf{rd = ra $\leftarrow$ ra + rb + (rc <\,{}< S)}\tabularnewline
\texttt{\textbf{\small adds}} & \texttt{\small rd = ra,rb,rc{*}S} & \textsf{rd = ra $\leftarrow$ ra - rb + (rc <\,{}< S)}\tabularnewline
\texttt{\textbf{\small addm}} & \texttt{\small rd = ra,rb,rc} & \textsf{rd = ra $\leftarrow$ (ra + rb) \& rc}\tabularnewline
\texttt{\textbf{\small subm}} & \texttt{\small rd = ra,rb,rc} & \textsf{rd = ra $\leftarrow$ (ra - rb) \& rc}\tabularnewline
\texttt{\textbf{\small addc}} & \texttt{\small rd = ra,rb,rc} & \textsf{rd = ra $\leftarrow$ (ra + rb) + rc.cf}\tabularnewline
\texttt{\textbf{\small subc}} & \texttt{\small rd = ra,rb,rc} & \textsf{rd = ra $\leftarrow$ (ra - rb) - rc.cf}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{All operations merge the ALU result with} \textsf{\emph{ra}}
\textsf{and generate flags in accordance with the standard x86 merging
rules described previously.}
\item \textsf{The} \texttt{\small adda}\textsf{ and} \texttt{\small adds}\textsf{
uops are useful for small shifts and x86 three-operand} \texttt{\small LEA}\textsf{-style
address generation.}
\item \textsf{The} \texttt{\small addc}\textsf{ and} \texttt{\small subc}\textsf{
uops use only the carry flag field of their rc operand; the value
is unused.}
\item \textsf{The} \texttt{\small addm}\textsf{ and} \texttt{\small subm}\textsf{
uops mask the result by the immediate in} \textsf{\emph{rc}}\textsf{.
They are used in microcode for modular stack arithmetic.}
\end{itemize}
\newpage{}\texttt{\textbf{\large sel}}\textsf{}\\
\textsf{\Large Conditional Select}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{lll}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small sel.}}\texttt{\textbf{\emph{\small cc}}} & \texttt{\small rd = ra,rb,(rc)} & \textsf{rd = ra $\leftarrow$ (EvalFlags(rc)) ? rb : ra}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{\textbf{\emph{cc}}} \textsf{is any valid condition code flag
evaluation}
\item \textsf{The} \texttt{\small sel}\textsf{ uop merges the selected operand
with} \textsf{\emph{ra}} \textsf{in accordance with the standard x86
merging rules described previously}
\item \textsf{The 64-bit result and all flags are treated as a single value
for selection purposes, i.e. the flags attached to the selected input
are passed to the output}
\item \textsf{If one of the (ra, rb) operands is not valid (has} \texttt{\small FLAG\_INV}\textsf{
set) but the selected operand is valid, the result is valid. This
is an exception to the invalid bit propagation rule only when the
selected input is valid. If the} \textsf{\emph{rc}} \textsf{operand
is invalid, the result is always invalid.}
\item \textsf{If any of the inputs are waiting (}\texttt{\small FLAG\_WAIT}\textsf{
is set), the uop does not issue, even if the selected input was ready.
This is a pipeline simplification.}
\item \textsf{set rd = (a),b}
\item \textsf{sel rd = b,0,1,c}
\end{itemize}
\newpage{}\texttt{\textbf{\large set}}\textsf{}\\
\textsf{\Large Conditional Set}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{lll}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small set.}}\texttt{\textbf{\emph{\small cc}}} & \texttt{\small rd = ra,rb,(rc)} & \textsf{rd = ra $\leftarrow$ EvalFlags(rc) ? rb : 0}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{\textbf{\emph{cc}}} \textsf{is any valid condition code flag
evaluation}
\item \textsf{The value 0 or 1 is zero extended to the operation size and
merged with} \textsf{\emph{rb}} \textsf{in accordance with the standard
x86 merging rules described previously (except that} \texttt{\small set}\textsf{
uses} \textsf{\emph{rb}} \textsf{as the merge target instead of} \textsf{\emph{ra}}\textsf{)}
\item \textsf{Flags attached to} \textsf{\emph{ra}} \textsf{(condition code)
are passed through to the output}
\end{itemize}
\newpage{}\texttt{\textbf{\large set.sub set.and}}\textsf{}\\
\textsf{\Large Conditional Compare and Set}{\Large{} \lyxline{\Large}}{\Large \par}

\begin{tabular}{lll}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small set.sub.}}\texttt{\textbf{\emph{\small cc}}} & \texttt{\small rd = ra,rb,rc} & \textsf{rd = rc $\leftarrow$ EvalFlags(ra - rb) ? 1 : 0}\tabularnewline
\texttt{\textbf{\small set.and.}}\texttt{\textbf{\emph{\small cc}}} & \texttt{\small rd = ra,rb,rc} & \textsf{rd = rc $\leftarrow$ EvalFlags(ra \& rb) ? 1 : 0}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{The} \texttt{\small set.sub}\textsf{ and} \texttt{\small set.and}\textsf{
uops take the place of a} \texttt{\small sub}\textsf{ or} \texttt{\small and}\textsf{
uop immediately consumed by a} \texttt{\small set}\textsf{ uop; this
is intended to shorten the critical path if uop merging is performed
by the processor}
\item \textsf{\textbf{\emph{cc}}} \textsf{is any valid condition code flag
evaluation}
\item \textsf{The value 0 or 1 is zero extended to the operation size and
then merged with} \textsf{\emph{rc}} \textsf{in accordance with the
standard x86 merging rules described previously (except that} \texttt{\small set.sub}\textsf{
and} \texttt{\small set.and}\textsf{ use} \textsf{\emph{rc}} \textsf{as
the merge target instead of} \textsf{\emph{ra}}\textsf{)}
\item \textsf{Flags generated as the result of the comparison are passed
through with the result}
\end{itemize}
\newpage{}\texttt{\textbf{\large br}}\textsf{}\\
\textsf{\Large Conditional Branch}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{lll}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small br}}\texttt{\textbf{\emph{\small .cc}}} & \texttt{\small rip = (ra,rb),riptaken,ripseq} & \textsf{rip = EvalFlags(ra) ? riptaken : ripseq}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{\textbf{\emph{cc}}} \textsf{is any valid condition code flag
evaluation}
\item \textsf{The} \texttt{\small rip}\textsf{ (user-visible instruction
pointer register) is reset to one of two immediates. If the flags
evaluation is true, the} \textsf{\emph{riptaken}} \textsf{immediate
is selected; otherwise the} \textsf{\emph{ripseq}} \textsf{immediate
is selected.}
\item \textsf{If the flag evaluation is false (i.e., ripseq is selected),
the} \texttt{\small BranchMispredict}\textsf{ internal exception is
raised. The processor should annul all uops after the branch and restart
fetching at the RIP specified by the result (in this case,} \textsf{\emph{ripseq}}\textsf{).}
\item \textsf{Branches are always assumed to be taken. If the branch is
predicted as not taken (i.e. future uops come from the next sequential
RIP after the branch), it is the responsibility of the decoder or
frontend to swap the} \textsf{\emph{riptaken}} \textsf{and} \textsf{\emph{ripseq}}
\textsf{immediates and invert the condition of the branch. All condition
encodings can be inverted by inverting bit 0 of the 4-bit condition
specifier.}
\item \textsf{The destination register should always be} \texttt{\small REG\_rip}\textsf{;
otherwise this uop is undefined.}
\item \textsf{If the target RIP falls within an unmapped page, not present
page or a page marked as no-execute (NX), the} \texttt{\small PageFaultOnExec}\textsf{
exception is taken.}
\item \textsf{No flags are generated by this uop}
\end{itemize}
\newpage{}\texttt{\textbf{\large br.sub br.and}}\textsf{}\\
\textsf{\Large Compare and Conditional Branch}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{lll}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small br}}\texttt{\textbf{\emph{\small .cc}}} & \texttt{\small rip = ra,rb,riptaken,ripseq} & \textsf{rip = EvalFlags(ra - rb) ? riptaken : ripseq}\tabularnewline
\texttt{\textbf{\small br}}\texttt{\textbf{\emph{\small .cc}}} & \texttt{\small rip = ra,rb,riptaken,ripseq} & \textsf{rip = EvalFlags(ra \& rb) ? riptaken : ripseq}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{The} \texttt{\small br.sub}\textsf{ and} \texttt{\small br.and}\textsf{
uops take the place of a} \texttt{\small sub}\textsf{ or} \texttt{\small and}\textsf{
uop immediately consumed by a} \texttt{\small br}\textsf{ uop; this
is intended to shorten the critical path if uop merging is performed
by the processor}
\item \textsf{\textbf{\emph{cc}}} \textsf{is any valid condition code flag
evaluation}
\item \textsf{The} \texttt{\small rip}\textsf{ (user-visible instruction
pointer register) is reset to one of two immediates. If the flags
evaluation is true, the} \textsf{\emph{riptaken}} \textsf{immediate
is selected; otherwise the} \textsf{\emph{ripseq}} \textsf{immediate
is selected}
\item \textsf{If the flag evaluation is false (i.e., ripseq is selected),
the} \texttt{\small BranchMispredict}\textsf{ internal exception is
raised. The processor should annul all uops after the branch and restart
fetching at the RIP specified by the result (in this case,} \textsf{\emph{ripseq}}\textsf{)}
\item \textsf{Branches are always assumed to be taken. If the branch is
predicted as not taken (i.e. future uops come from the next sequential
RIP after the branch), it is the responsibility of the decoder or
frontend to swap the} \textsf{\emph{riptaken}} \textsf{and} \textsf{\emph{ripseq}}
\textsf{immediates and invert the condition of the branch. All condition
encodings can be inverted by inverting bit 0 of the 4-bit condition
specifier.}
\item \textsf{The destination register should always be} \texttt{\small REG\_rip}\textsf{;
otherwise this uop is undefined}
\item \textsf{If the target RIP falls within an unmapped page, not present
page or a page marked as no-execute (NX), the} \texttt{\small PageFaultOnExec}\textsf{
exception is taken.}
\item \textsf{Flags generated as the result of the comparison are passed
through with the result}
\end{itemize}
\newpage{}\texttt{\textbf{\large jmp}}\textsf{}\\
\textsf{\Large Indirect Jump}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{lll}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small jmp}} & \texttt{\small rip = ra,riptaken} & \textsf{rip = ra}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{The} \texttt{\small rip}\textsf{ (user-visible instruction
pointer register) is reset to the target address specified by} \textsf{\emph{ra}}
\item \textsf{If the} \textsf{\emph{ra}} \textsf{operand does not match
the} \textsf{\emph{riptaken}} \textsf{immediate, the} \texttt{\small BranchMispredict}\textsf{
internal exception is raised. The processor should annul all uops
after the branch and restart fetching at the RIP specified by the
result (in this case,} \textsf{\emph{ra}}\textsf{)}
\item \textsf{Indirect jumps are always assumed to match the predicted target
in} \textsf{\emph{riptaken}}\textsf{. If some other target is predicted,
it is the responsibility of the decoder or frontend to set the} \textsf{\emph{riptaken}}
\textsf{immediate to that predicted target}
\item \textsf{The destination register should always be} \texttt{\small REG\_rip}\textsf{;
otherwise this uop is undefined}
\item \textsf{If the target RIP falls within an unmapped page, not present
page or a marked as no-execute (NX), the} \texttt{\small PageFaultOnExec}\textsf{
exception is taken.}
\item \textsf{No flags are generated by this uop}
\end{itemize}
\newpage{}\texttt{\textbf{\large jmpp}}\textsf{}\\
\textsf{\Large Indirect Jump Within Microcode}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{lll}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small jmpp}} & \texttt{\small null = ra,riptaken} & \textsf{internalrip = ra}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{The} \texttt{\small jmpp}\textsf{ uop redirects uop fetching
into microcode not accessible as x86 instructions. The target address
(inside PTLsim, not x86 space) is specified by} \textsf{\emph{ra}}
\item \textsf{If the} \textsf{\emph{ra}} \textsf{operand does not match
the} \textsf{\emph{riptaken}} \textsf{immediate, the} \texttt{\small BranchMispredict}\textsf{
internal exception is raised. The processor should annul all uops
after the branch and restart fetching at the RIP specified by the
result (in this case,} \textsf{\emph{ra}}\textsf{)}
\item \textsf{Indirect jumps are always assumed to match the predicted target
in} \textsf{\emph{riptaken}}\textsf{. If some other target is predicted,
it is the responsibility of the decoder or frontend to set the} \textsf{\emph{riptaken}}
\textsf{immediate to that predicted target}
\item \textsf{The destination register should always be} \texttt{\small REG\_rip}\textsf{;
otherwise this uop is undefined}
\item \textsf{The user visible rip register is not updated after this uop
issues; otherwise it would point into PTLsim space not accessible
to x86 code. Updating is resumed after a normal} \texttt{\small jmp}\textsf{
issues to return to user code. It is the responsibility of the decoder
to move the user address to return to into some temporary register
(traditionally} \texttt{\small REG\_sr2}\textsf{ but this is not required).}
\item \textsf{No flags are generated by this uop}
\end{itemize}
\newpage{}\texttt{\textbf{\large bru}}\textsf{}\\
\textsf{\Large Unconditional Branch}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{lll}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small bru}} & \texttt{\small rip = riptaken} & \textsf{rip = riptaken}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{The} \texttt{\small rip}\textsf{ (user-visible instruction
pointer register) is reset to the specified immediate. The processor
may redirect fetching from the new RIP}
\item \textsf{No exceptions are possible with unconditional branches}
\item \textsf{If the target RIP falls within an unmapped page, not present
page or a marked as no-execute (NX), the} \texttt{\small PageFaultOnExec}\textsf{
exception is taken.}
\item \textsf{No flags are generated by this uop}
\end{itemize}
\newpage{}\texttt{\textbf{\large brp}}\textsf{}\\
\textsf{\Large Unconditional Branch Within Microcode}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{lll}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small bru}} & \texttt{\small null = riptaken} & \textsf{internalrip = riptaken}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{The} \texttt{\small brp}\textsf{ uop redirects uop fetching
into microcode not accessible as x86 instructions. The target address
(inside PTLsim, not x86 space) is specified by the} \textsf{\emph{riptaken}}
\textsf{immediate}
\item \textsf{The} \texttt{\small rip}\textsf{ (user-visible instruction
pointer register) is reset to the specified} \textsf{\emph{riptaken}}
\textsf{immediate. The processor may redirect fetching from the new
RIP}
\item \textsf{No exceptions are possible with unconditional branches}
\item \textsf{The user visible rip register is not updated after this uop
issues; otherwise it would point into PTLsim space not accessible
to x86 code. Updating is resumed after a normal} \texttt{\small jmp}\textsf{
uop issues to return to user code. It is the responsibility of the
decoder to move the user address to return to into some temporary
register (traditionally} \texttt{\small REG\_sr2}\textsf{ but this
is not required).}
\item \textsf{No flags are generated by this uop}
\end{itemize}
\newpage{}\texttt{\textbf{\large chk}}\textsf{}\\
\textsf{\Large Check Speculation}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{lll}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small chk}}\texttt{\textbf{\emph{\small .cc}}} & \texttt{\small rd = ra,recrip,extype} & \textsf{rd = EvalCheck(ra) ? 0 : recrip}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{The} \texttt{\small chk}\textsf{ uop verifies} \textsf{\emph{certain}}
\textsf{properties about ra. If this verification check passes, no
action is taken. If the check fails,} \texttt{\small chk}\textsf{
signals an exception of the user specified type in the} \textsf{\emph{rc}}
\textsf{immediate. The result of the} \texttt{\small chk}\textsf{
uop in this case is the user specified RIP to recover at after the
check failure is handled in microcode. This recovery RIP is saved
in the} \texttt{\small recoveryrip}\textsf{ internal register.}
\item \textsf{This mechanism is intended to allow simple inlined uop sequences
to branch into microcode if certain conditions fail, since normally
inlined uop sequences cannot contain embedded branches. One example
use is in the} \texttt{\small REP}\textsf{ series of instructions
to ensure that the count is not zero on entry (a special corner case).}
\item \textsf{Unlike most conditional uops, the} \texttt{\small chk}\textsf{
uop directly checks the numerical value of} \textsf{\emph{ra}} \textsf{against
zero, and ignores any attached flags. Therefore, the} \textsf{\textbf{\emph{cc}}}
\textsf{condition code flag evaluation type is restricted to the subset
(e, ne, be, nbe, l, nl, le, nle).}
\item \textsf{No flags are generated by this uop}
\end{itemize}
\newpage{}\texttt{\textbf{\large ld ld.lo ld.hi ldx ldx.lo ldx.hi}}\textsf{}\\
\textsf{\Large Load}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{llp{0.5\columnwidth}}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small ld}} & \texttt{\small rd = {[}ra,rb],sfra} & \textsf{rd = MergeWithSFR(mem{[}ra + rb], sfra)}\tabularnewline
\texttt{\textbf{\small ld.lo}} & \texttt{\small rd = {[}ra+rb],sfra} & \textsf{rd = MergeWithSFR(mem{[}floor(ra + rb), 8], sfra)}\tabularnewline
\texttt{\textbf{\small ld.hi}} & \texttt{\small rd = {[}ra+rb],rc,sfra} & \parbox[t]{0.5\columnwidth}{\textsf{rd = MergeAlign(}\\
\textsf{~~MergeWithSFR(mem{[}(floor(ra + rb), 8) + 8], sfra), rc)}}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{\emph{The PTLsim load unit model is described in substantial
detail in Section \ref{sec:IssuingLoads}; this section only gives
an overview of the load uop semantics.}}
\item \textsf{The} \texttt{\small ld}\textsf{ family of uops loads values
from the virtual address specified by the sum} \textsf{\emph{ra}}
\textsf{+} \textsf{\emph{rb}}\textsf{. The} \texttt{\small ld}\textsf{
form zero extends the loaded value, while the} \texttt{\small ldx}\textsf{
form sign extends the loaded value to 64 bits.}
\item \textsf{All values are zero or sign extended to 64 bits; no subword
merging takes place as with ALU uops. The decoder is responsible for
following the load with an explicit} \texttt{\small mov}\textsf{ uop
to merge 8-bit and 16-bit loads with their old destination register.}
\item \textsf{The} \textsf{\emph{sfra}} \textsf{operand specifies the store
forwarding register (a.k.a. store buffer) to merge with data from
the cache to form the final result. The inherited SFR may be determined
dynamically by querying a store queue or can be predicted statically.}
\item \textsf{If the load misses the cache, the} \texttt{\small FLAG\_WAIT}\textsf{
flag of the result is set.}
\item \textsf{Load uops do not generate any other condition code flags}
\end{itemize}
\textsf{\textbf{Unaligned Load Support:}}

\begin{itemize}
\item \textsf{The processor supports unaligned loads via a pair of} \texttt{\small ld.lo}\textsf{
and} \texttt{\small ld.hi}\textsf{ uops; an overview can be found
in Section \ref{sub:UnalignedLoadsAndStores}. The alignment type
of the load is stored in the uop's cond field (0 =} \texttt{\small ld}\textsf{,
1 =} \texttt{\small ld.lo}\textsf{, 2 =} \texttt{\small ld.hi}\textsf{).}
\item \textsf{The} \texttt{\small ld.lo}\textsf{ uop rounds down its effective
address $\left\lfloor ra+rb\right\rfloor $ to the nearest 64-bit
boundary and performs the load. The} \texttt{\small ld.hi}\textsf{
uop rounds $\left\lceil ra+rb+8\right\rceil $ up to the next 64-bit
boundary, performs a load at that address, then takes as its third
rc operand the first (}\texttt{\small ld.lo}\textsf{) load's result.
The two loads are concatenated into a 128-bit word and the final unaligned
data is extracted (and sign extended if the} \texttt{\small ldx}\textsf{
form was used). }
\item \textsf{Special corner case for when the actual user address (}\textsf{\emph{ra}}
\textsf{+} \textsf{\emph{rb}}\textsf{) did not actually require any
bytes in the 8-byte range loaded by the} \texttt{\small ld.hi}\textsf{
uop (i.e. the load was contained entirely within the low 64-bit aligned
chunk). Since it is perfectly legal to do an unaligned load to the
very end of the page such that the next 64 bit chunk is not mapped
to a valid page, the} \texttt{\small ld.hi}\textsf{ uop does not actually
access memory; the entire result is extracted from the prior} \texttt{\small ld.lo}\textsf{
result in the} \textsf{\emph{rc}} \textsf{operand.}
\end{itemize}
\textsf{\textbf{Exceptions:}}

\begin{itemize}
\item \texttt{\small UnalignedAccess}\textsf{ if the address (}\textsf{\emph{ra}}
\textsf{+} \textsf{\emph{rb}}\textsf{) is not aligned to an integral
multiple of the size in bytes of the load. Unaligned loads (}\texttt{\small ld.lo}\textsf{
and} \texttt{\small ld.hi}\textsf{) do not generate this exception.
Since x86 automatically corrects alignment problems, microcode must
handle this exception as described in Section \ref{sub:UnalignedLoadsAndStores}.}
\item \texttt{\small PageFaultOnRead}\textsf{ if the virtual address (}\textsf{\emph{ra}}
\textsf{+} \textsf{\emph{rb}}\textsf{) falls on a page not accessible
to the caller in the current operating mode, or a page marked as not
present.}
\item \textsf{Various other exceptions and replay conditions may exist depending
on the specific processor core model.}
\end{itemize}
\newpage{}\texttt{\textbf{\large st}}\textsf{}\\
\textsf{\Large Store}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{lll}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small st}} & \texttt{\small sfrd = {[}ra,rb],rc,sfra} & \textsf{sfrd = MergeWithSFR((ra + rb), sfra, rc)}\tabularnewline
\texttt{\textbf{\small st.lo}} & \texttt{\small sfrd = {[}ra+rb],rc,sfra} & \textsf{sfrd = MergeWithSFR(floor(ra + rb, 8), sfra, rc)}\tabularnewline
\texttt{\textbf{\small st.hi}} & \texttt{\small sfrd = {[}ra+rb],rc,sfra} & \textsf{sfrd = MergeWithSFR(floor(ra + rb, 8) + 8, sfra, rc)}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{\emph{The PTLsim store unit model is described in substantial
detail in Section \ref{sec:StoreMerging}; this section only gives
an overview of the store uop semantics.}}
\item \textsf{The} \texttt{\small st}\textsf{ family of uops prepares values
to be stored to the virtual address specified by the sum} \textsf{\emph{ra}}
\textsf{+} \textsf{\emph{rb}}\textsf{.}
\item \textsf{The} \textsf{\emph{sfra}} \textsf{operand specifies the store
forwarding register (a.k.a. store buffer) to merge the data to be
stored (the} \textsf{\emph{rc}} \textsf{operand) into. The inherited
SFR may be determined dynamically by querying a store queue or can
be predicted statically, as described in} \textsf{\emph{\ref{sec:StoreMerging}.}}
\item \textsf{Store uops only generate the SFR for tracking purposes; the
cache is only written when the SFR is committed.}
\item \textsf{The store uop may issue as soon as the} \textsf{\emph{ra}}
\textsf{and} \textsf{\emph{rb}} \textsf{operands are ready, even if
the} \textsf{\emph{rc}} \textsf{and} \textsf{\emph{sfra}} \textsf{operands
are not known. The store must be replayed once these operands become
known, in accordance with Section \ref{sec:SplitPhaseStores}.}
\item \textsf{Store uops do not generate any other condition code flags}
\end{itemize}
\textsf{\textbf{Unaligned Store Support:}}

\begin{itemize}
\item \textsf{The processor supports unaligned stores via a pair of} \texttt{\small st.lo}\textsf{
and} \texttt{\small st.hi}\textsf{ uops; an overview can be found
in Section \ref{sub:UnalignedLoadsAndStores}. The alignment type
of the load is stored in the uop's cond field (0 =} \texttt{\small st}\textsf{,
1 =} \texttt{\small st.lo}\textsf{, 2 =} \texttt{\small st.hi}\textsf{).}
\item Stores are handled in a similar manner, with \texttt{\small st.lo}
and \texttt{\small st.hi} rounding down and up to store parts of the
unaligned value in adjacent 64-bit blocks. 
\item \textsf{The} \texttt{\small st.lo}\textsf{ uop rounds down its effective
address $\left\lfloor ra+rb\right\rfloor $ to the nearest 64-bit
boundary and stores the appropriately aligned portion of the} \texttt{\small rc}\textsf{
operand that actually falls within that range of 8 bytes. The} \texttt{\small ld.hi}\textsf{
uop rounds $\left\lceil ra+rb+8\right\rceil $ up to the next 64-bit
boundary and similarly stores the appropriately aligned portion of
the} \texttt{\small rc}\textsf{ operand that actually falls within
that high range of 8 bytes.}
\item \textsf{Special corner case for when the actual user address (}\textsf{\emph{ra}}
\textsf{+} \textsf{\emph{rb}}\textsf{) did not actually touch any
bytes in the 8-byte range normally written by the} \texttt{\small st.hi}\textsf{
uop (i.e. the store was contained entirely within the low 64-bit aligned
chunk). Since it is perfectly legal to do an unaligned store to the
very end of the page such that the next 64 bit chunk is not mapped
to a valid page, the} \texttt{\small st.hi}\textsf{ uop does not actually
do anything in this case (the bytemask of the generated SFR is set
to zero and no exceptions are checked).}
\end{itemize}
\textsf{\textbf{Exceptions:}}

\begin{itemize}
\item \texttt{\small UnalignedAccess}\textsf{ if the address (}\textsf{\emph{ra}}
\textsf{+} \textsf{\emph{rb}}\textsf{) is not aligned to an integral
multiple of the size in bytes of the store. Unaligned stores (}\texttt{\small st.lo}\textsf{
and} \texttt{\small st.hi}\textsf{) do not generate this exception.
Since x86 automatically corrects alignment problems, microcode must
handle this exception as described in Section \ref{sub:UnalignedLoadsAndStores}.}
\item \texttt{\small PageFaultOnWrite}\textsf{ if the virtual address (}\textsf{\emph{ra}}
\textsf{+} \textsf{\emph{rb}}\textsf{) falls on a write protected
page, a page not accessible to the caller in the current operating
mode, or a page marked as not present.}
\item \texttt{\small LoadStoreAliasing}\textsf{ if a prior load is found
to alias the store (see Section \ref{sub:AliasCheck}).}
\item \textsf{Various other exceptions and replay conditions may exist depending
on the specific processor core model.}
\end{itemize}
\newpage{}\texttt{\textbf{\large ldp ldxp}}\textsf{}\\
\textsf{\Large Load from Internal Microcode Space}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{lll}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small ldp}} & \texttt{\small rd = {[}ra,rb]} & \textsf{rd = MSR{[}ra+rb]}\tabularnewline
\texttt{\textbf{\small ldxp}} & \texttt{\small rd = {[}ra+rb]} & \textsf{rd = SignExt(MSR{[}ra+rb])}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{The} \texttt{\small ldp}\textsf{ and} \texttt{\small ldxp}\textsf{
uops load values from the internal PTLsim address space not accessible
to x86 code. Typically this address space is mapped to internal machine
state registers (MSRs) and microcode scratch space. The internal address
to access is specified by the sum} \textsf{\emph{ra}} \textsf{+} \textsf{\emph{rb}}\textsf{.
The} \texttt{\small ldp}\textsf{ form zero extends the loaded value,
while the} \texttt{\small ldxp}\textsf{ form sign extends the loaded
value to 64 bits.}
\item \textsf{Load uops do not generate any other condition code flags}
\item \textsf{Internal loads may not be unaligned, and never stall or generate
exceptions.}
\end{itemize}
\newpage{}\texttt{\textbf{\large stp}}\textsf{}\\
\textsf{\Large Store to Internal Microcode Space}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{lll}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small stp}} & \texttt{\small null = {[}ra,rb],rc} & \textsf{MSR{[}ra+rb] = rc}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{The} \texttt{\small stp}\textsf{ uop stores a value to the
internal PTLsim address space not accessible to x86 code. Typically
this address space is mapped to internal machine state registers (MSRs)
and microcode scratch space. The internal address to store is specified
by the sum} \textsf{\emph{ra}} \textsf{+} \textsf{\emph{rb}} \textsf{and
the value to store is specified by} \textsf{\emph{rc}}\textsf{.}
\item \textsf{Store uops do not generate any other condition code flags}
\item \textsf{Internal stores may not be unaligned, and never stall or generate
exceptions.}
\end{itemize}
\newpage{}\texttt{\textbf{\large shl shr sar rotl rotr rotcl rotcr}}\textsf{}\\
\textsf{\Large Shifts and Rotates}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{lll}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small shl}} & \texttt{\small rd = ra,rb,rc} & \textsf{rd = ra $\leftarrow$ (ra <\,{}< rb)}\tabularnewline
\texttt{\textbf{\small shr}} & \texttt{\small rd = ra,rb,rc} & \textsf{rd = ra $\leftarrow$ (ra >\,{}> rb)}\tabularnewline
\texttt{\textbf{\small sar}} & \texttt{\small rd = ra,rb,rc} & \textsf{rd = ra $\leftarrow$ SignExt(ra >\,{}> rb)}\tabularnewline
\texttt{\textbf{\small rotl}} & \texttt{\small rd = ra,rb,rc} & \textsf{rd = ra $\leftarrow$ (ra} \textsf{\emph{rotateleft}} \textsf{rb)}\tabularnewline
\texttt{\textbf{\small rotr}} & \texttt{\small rd = ra,rb,rc} & \textsf{rd = ra $\leftarrow$ (ra} \textsf{\emph{rotateright}} \textsf{rb)}\tabularnewline
\texttt{\textbf{\small rotcl}} & \texttt{\small rd = ra,rb,rc} & \textsf{rd = ra $\leftarrow$ (\{rc.cf, ra\}} \textsf{\emph{rotateleft}}
\textsf{rb)}\tabularnewline
\texttt{\textbf{\small rotcr}} & \texttt{\small rd = ra,rb,rc} & \textsf{rd = ra $\leftarrow$ (\{rc.cf, ra\}} \textsf{\emph{rotateright}}
\textsf{rb)}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{The shift and rotate instructions have some of the most bizarre
semantics in the entire x86 instruction set: they may or may not modify
flags depending on the rotation count operand, which we may not even
know until the instruction issues. This is introduced in Section \ref{sec:ShiftRotateProblems}.}
\item \textsf{The specific rules are as follows:}

\begin{itemize}
\item \textsf{If the count $rb=0$ is zero, no flags are modified}
\item \textsf{If the count $rb=1$, both OF and CF are modified, but ZAPS
is preserved}
\item \textsf{If the count $rb>1$, only the CF is modified. (Technically
the value in OF is undefined, but on K8 and P4, it retains the old
value, so we try to be compatible).}
\item \textsf{Shifts also alter the ZAPS flags while rotates do not.}
\end{itemize}
\item \textsf{For constant counts (immediate} \textsf{\emph{rb}} \textsf{values),
the semantics are easy to determine in advance.}
\item \textsf{For variable counts (}\textsf{\emph{rb}} \textsf{comes from
register), things are more complex. Since the shift needs to determine
its output flags at runtime based on both the shift count and the
input flags (CF, OF, ZAPS), we need to specify the latest versions
in program order of all the existing flags. However, this would require
three operands to the shift uop not even counting the value and count
operands. Therefore, we use a} \texttt{\small collcc}\textsf{ (collect
condition code flags, see Section \ref{sub:FlagsManagement}) uop
to get all the most up to date flags into one result, using three
operands for ZAPS, CF, OF. This forms a zero word with all the correct
flags attached, which is then forwarded as the} \textsf{\emph{rc}}
\textsf{operand to the shift. This may add additional scheduling constraints
in the case that one of the operands to the shift itself sets the
flags, but this is fairly rare. Conveniently, this also lets us directly
implement the 65-bit} \texttt{\small rotcl}\textsf{/}\texttt{\small rotcr}\textsf{
uops in hardware with little additional complexity.}
\item \textsf{All operations merge the ALU result with} \textsf{\emph{ra}}
\textsf{and generate flags in accordance with the standard x86 merging
rules described previously.}
\item \textsf{The specific flags attached to the result depend on the input
conditions described above. The user should always assume these uops
always produce the latest version of each of the ZAPS, CF, OF flag
sets.}
\end{itemize}
\newpage{}\texttt{\textbf{\large mask}}\textsf{}\\
\textsf{\Large Masking, Insertion and Extraction}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{lll}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small mask}}\texttt{\textbf{\emph{\small .x|z}}} & \texttt{\small rd = ra,rb,{[}ms,mc,ds]} & \textsf{See semantics below}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{The} \texttt{\small mask}\textsf{ uop and its variants are
used for generalized bit field extraction, insertion, sign and zero
extension using the 18-bit control field in the immediate}
\item \textsf{These uops are used extensively within PTLsim microcode, but
are also useful if the processor supports dynamically merging a chain
of} \texttt{\small shr}\textsf{,} \texttt{\small and}\textsf{,} \texttt{\small or}\textsf{
uops.}
\item \textsf{The condition code flags (ZAPS, CF, OF) are the flags logically
generated by the final AND operation.}
\end{itemize}
\medskip{}
\textsf{\textbf{Control Field Format}}

\textsf{The 18-bit} \textsf{\emph{rc}} \textsf{immediate has the following
three 6-bit fields:}

\noindent \begin{center}
\textsf{}\begin{tabular}{|c|c|c|}
\hline 
\textsf{\textbf{DS}} & \textsf{\textbf{MC}} & \textsf{\textbf{MS}}\tabularnewline
\hline 
\textsf{12} & \textsf{6} & \textsf{0}\tabularnewline
\hline
\end{tabular}
\par\end{center}

\begin{itemize}
\item \textsf{The} \texttt{\small mask}\textsf{ uop and its variants are
used for generalized bit field extraction, insertion, sign and zero
extension using the 18-bit control field in the immediate}
\end{itemize}
\medskip{}
\textsf{\textbf{Operation:}}

\begin{lyxcode}
{\small M~=~1'{[}(ms+mc-1):ms]}{\small \par}

{\small T~=~(ra~\&~\textasciitilde{}M)~|~((rb~>\,{}>\,{}>~ds)~\&~M)}{\small \par}

{\small if~(Z)~\{}{\small \par}

{\small{}~~\#~Zero~extend}{\small \par}

{\small{}~~rd~=~ra~$\leftarrow$~(T~\&~1'{[}(ms+mc-1):0])}{\small \par}

{\small else~if~(X)~\{}{\small \par}

{\small{}~~\#~Sign~extend}{\small \par}

{\small{}~~rd~=~ra~$\leftarrow$~(T{[}ms+mc-1])~?~(T~|~1'{[}63:(ms+mc)])~:~(T~\&~1'{[}(ms+mc-1):0])}{\small \par}

{\small \}~else~\{}{\small \par}

{\small{}~~rd~=~ra~$\leftarrow$~T}{\small \par}

{\small \}}{\small \par}


\end{lyxcode}
\newpage{}\texttt{\textbf{\large bswap}}\textsf{}\\
\textsf{\Large Byte Swap}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{lll}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small bswap}} & \texttt{\small rd = ra} & \textsf{rd = ra $\leftarrow$ ByteSwap(rb)}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{The} \texttt{\small bswap}\textsf{ uop reverses the endianness
of the} \textsf{\emph{rb}} \textsf{operand. The uop's effective result
size determines the range of bytes which are reversed.}
\item \textsf{This uop's semantics are identical to the x86} \texttt{\small bswap}\textsf{
instruction.}
\item \textsf{This uop does not generate any condition code flags.}
\end{itemize}
\newpage{}\texttt{\textbf{\large collcc}}\textsf{}\\
\textsf{\Large Collect Condition Codes}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{llp{0.5\columnwidth}}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small collcc}} & \texttt{\small rd = ra,rb,rc} & \parbox[t]{0.5\columnwidth}{\textsf{rd.zaps = ra.zaps }\\
\textsf{rd.cf = rb.cf}\\
\textsf{rd.of = rc.of}\\
\textsf{rd = rd.flags}}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{The} \texttt{\small collcc}\textsf{ uop collects the condition
code flags from three potentially distinct source operands into a
single output with the combined condition code flags in both its appended
flags and data.}
\item \textsf{This uop is useful for collecting all flags before passing
them as input to another uop which only supports one source of flags
(for instance, the shift and rotate uops).}
\end{itemize}
\newpage{}\texttt{\textbf{\large movccr movrcc}}\textsf{}\\
\textsf{\Large Move Condition Code Flags Between Register Value and
Flag Parts}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{llp{0.5\columnwidth}}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small movccr}} & \texttt{\small rd = ra} & \parbox[t]{0.5\columnwidth}{\textsf{rd = ra.flags}\\
\textsf{rd.flags = 0}}\tabularnewline
\texttt{\textbf{\small movrcc}} & \texttt{\small rd = ra} & \parbox[t]{0.5\columnwidth}{\textsf{rd.flags = ra}\\
\textsf{rd = ra}}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{The} \texttt{\small movccr}\textsf{ uop takes the condition
code flag bits attached to} \textsf{\emph{ra}} \textsf{and copies
them into the 64-bit register part of the result.}
\item \textsf{The} \texttt{\small movrcc}\textsf{ uop takes the low bits
of the} \textsf{\emph{ra}} \textsf{operand and moves those bits into
the condition code flag bits attached to the result.}
\item \textsf{The bits moved consist of the ZF, PF, SF, CF, OF flags}
\item \textsf{The WAIT and INV flags of the result are always cleared since
the uop would not even issue if these were set in} \textsf{\emph{ra}}\textsf{.}
\end{itemize}
\newpage{}\texttt{\textbf{\large andcc orcc ornotcc xorcc}}\textsf{}\\
\textsf{\Large Logical Operations on Condition Codes}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{lll}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small andcc}} & \texttt{\small rd = ra,rb} & \textsf{rd.flags = ra.flags \& rb.flags}\tabularnewline
\texttt{\textbf{\small orcc}} & \texttt{\small rd = ra,rb} & \textsf{rd.flags = ra.flags | rb.flags}\tabularnewline
\texttt{\textbf{\small ornotcc}} & \texttt{\small rd = ra,rb} & \textsf{rd.flags = ra.flags | (\textasciitilde{}rb.flags)}\tabularnewline
\texttt{\textbf{\small xorcc}} & \texttt{\small rd = ra,rb} & \textsf{rd.flags = ra.flags \textasciicircum{} rb.flags}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{These uops are used to perform logical operations on the condition
code flags attached to} \textsf{\emph{ra}} \textsf{and} \textsf{\emph{rb}}\textsf{.}
\item \textsf{If the} \textsf{\emph{rb}} \textsf{operand is an immediate,
the immediate data is used instead of the flags normally attached
to a register operand.}
\item \textsf{The 64-bit value of the output is always set to zero.}
\end{itemize}
\newpage{}\texttt{\textbf{\large mull mulh}}\textsf{}\\
\textsf{\Large Integer Multiplication}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{lll}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small mull}} & \texttt{\small rd = ra,rb} & \textsf{rd = ra $\leftarrow$ lowbits(ra $\times$ rb)}\tabularnewline
\texttt{\textbf{\small mulh}} & \texttt{\small rd = ra,rb} & \textsf{rd = ra $\leftarrow$ highbits(ra $\times$ rb)}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{These uops multiply} \textsf{\emph{ra}} \textsf{and} \textsf{\emph{rb}}\textsf{,
then retain only the low} \textsf{\emph{N}} \textsf{bits or high}
\textsf{\emph{N}} \textsf{bits of the result (where N is the uop's
effective result size in bits). This result is then merged into} \textsf{\emph{ra}}\textsf{.}
\item \textsf{The condition code flags generated by these uops correspond
to the normal x86 semantics for integer multiplication (}\texttt{\small imul}\textsf{);
the flags are calculated relative to the effective result size.}
\item \textsf{The} \textsf{\emph{rb}} \textsf{operand may be an immediate}
\end{itemize}
\newpage{}\texttt{\textbf{\large bt bts btr btc}}\textsf{}\\
\textsf{\Large Bit Testing and Manipulation}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{llp{0.5\columnwidth}}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small bt}} & \texttt{\small rd = ra,rb} & \parbox[t]{0.5\columnwidth}{\textsf{rd.cf = ra{[}rb] }\\
\textsf{rd = ra $\leftarrow$ (rd.cf) ? -1 : +1}}\tabularnewline
\texttt{\textbf{\small bts}} & \texttt{\small rd = ra,rb} & \parbox[t]{0.5\columnwidth}{\textsf{rd.cf = ra{[}rb]}\\
\textsf{rd = ra $\leftarrow$ ra | (1 <\,{}< rb)}}\tabularnewline
\texttt{\textbf{\small btr}} & \texttt{\small rd = ra,rb} & \parbox[t]{0.5\columnwidth}{\textsf{rd.cf = ra{[}rb]}\\
\textsf{rd = ra $\leftarrow$ ra \& (\textasciitilde{}(1 <\,{}< rb))}}\tabularnewline
\texttt{\textbf{\small btc}} & \texttt{\small rd = ra,rb} & \parbox[t]{0.5\columnwidth}{\textsf{rd.cf = ra{[}rb]}\\
\textsf{rd = ra $\leftarrow$ ra \textasciicircum{} (1 <\,{}< rb)}}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{These uops test a given bit in} \textsf{\emph{ra}} \textsf{and
then atomically modify (set, reset or complement) that bit in the
result.}
\item \textsf{The CF flag of the output is set to the original value in
bit position} \textsf{\emph{rb}} \textsf{of} \textsf{\emph{ra}}\textsf{.
Other condition code flag bits in the output are undefined.}
\item \textsf{The} \texttt{\small bt}\textsf{ (bit test) uop is special:
it generates a value of -1 or +1 if the tested bit is 1 or 0, respectively.
This is used in microcode for setting up an increment for the} \texttt{\small rep}\textsf{
x86 instructions.}
\end{itemize}
\newpage{}\texttt{\textbf{\large ctz clz}}\textsf{}\\
\textsf{\Large Count Trailing or Leading Zeros}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{llp{0.5\columnwidth}}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small ctz}} & \texttt{\small rd = ra} & \parbox[t]{0.5\columnwidth}{\textsf{rd.zf = (rb == 0)}\\
\textsf{rd = ra $\leftarrow$ (rb) ? LSBIndex(rb) : 0}}\tabularnewline
\texttt{\textbf{\small clz}} & \texttt{\small rd = ra} & \parbox[t]{0.5\columnwidth}{\textsf{rd.zf = (rb == 0)}\\
\textsf{rd = ra $\leftarrow$ (rb) ? MSBIndex(rb) : 0}}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{These uops find the bit index of the first '1' bit in} \textsf{\emph{rb}}\textsf{,
starting from the lowest bit 0 (for} \texttt{\small ctz}\textsf{)
or the highest bit of the data type (for} \texttt{\small clz}\textsf{).}
\item \textsf{The result is zero (technically, undefined) if ra is zero.}
\item \textsf{The ZF flag of the result is 1 if} \textsf{\emph{rb}} \textsf{was
zero, or 0 if} \textsf{\emph{rb}} \textsf{was nonzero. Other condition
code flags are undefined.}
\end{itemize}
\newpage{}\texttt{\textbf{\large ctpop}}\textsf{}\\
\textsf{\Large Count Population of '1' Bits}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{llp{0.5\columnwidth}}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small ctpop}} & \texttt{\small rd = ra} & \parbox[t]{0.5\columnwidth}{\textsf{rd.zf = (ra == 0)}\\
\textsf{rd = PopulationCount(ra)}}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{The} \texttt{\small ctpop}\textsf{ uop counts the number of
'1' bits in the} \textsf{\emph{ra}} \textsf{operand.}
\item \textsf{The ZF flag of the result is 1 if} \textsf{\emph{ra}} \textsf{was
zero, or 0 if} \textsf{\emph{ra}} \textsf{was nonzero. Other condition
code flags are undefined.}
\end{itemize}
\newpage{}\texttt{\textbf{\large ~}}\textsf{}\\
\textsf{\Large Floating Point Format and Merging}{\Large \lyxline{\Large}}{\Large \par}

\medskip{}
\textsf{All floating point uops use the same encoding to specify the
precision and vector format of the operands. The uop's} \textsf{\emph{size}}
\textsf{field is encoded as follows:}

\begin{itemize}
\item \texttt{\textbf{\small 00:}}\textsf{ Single precision scalar floating
point (}\texttt{\emph{\small op}}\texttt{\textbf{\small fp}}\textsf{
mnemonic). The operation is only performed on the low 32 bits (in
IEEE single precision format) of the 64-bit inputs; the high 32 bits
of the ra operand are copied to the high 32 bits of the output.}
\item \texttt{\textbf{\small 01:}}\textsf{ Single precision vector floating
point (}\texttt{\emph{\small op}}\texttt{\textbf{\small fv}}\textsf{
mnemonic). The operation is performed on both 32 bit halves (in IEEE
single precision format) of the 64-bit inputs in parallel}
\item \texttt{\textbf{\small 1x:}}\textsf{ Double precision scalar floating
point (}\texttt{\emph{\small op}}\texttt{\textbf{\small fd}}\textsf{
mnemonic). The operation is performed on the full 64 bit inputs (in
IEEE double precision format)}
\end{itemize}
\textsf{Most floating point operations merge the result with the}
\textsf{\emph{ra}} \textsf{operand to prepare the destination. Since
a full 64-bit result is generated with the vector and double formats,
the} \textsf{\emph{ra}} \textsf{operand is not needed and may be specified
as zero to reduce dependencies.}

\textsf{Exceptions to this encoding are listed where appropriate.}

\textsf{Unless otherwise noted, all operations update the internal
floating point status register (FPSR, equivalent to the MXCSR register
in x86 code) by ORing in any exceptions that occur. If the uop is
encoded to generate an actual exception on excepting conditions, the}
\texttt{\small FLAG\_INV}\textsf{ flag is attached to the output to
cause an exception at commit time.}

\textsf{No condition code flags are generated by floating point uops
unless otherwise noted.}

\newpage{}\texttt{\textbf{\large addf subf mulf divf minf maxf}}\textsf{}\\
\textsf{\Large Floating Point Arithmetic}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{lll}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small addf}} & \texttt{\small rd = ra,rb} & \textsf{rd = ra $\leftarrow$ ra + rb}\tabularnewline
\texttt{\textbf{\small subf}} & \texttt{\small rd = ra,rb} & \textsf{rd = ra $\leftarrow$ ra - rb}\tabularnewline
\texttt{\textbf{\small mulf}} & \texttt{\small rd = ra,rb} & \textsf{rd = ra $\leftarrow$ ra $\times$ rb}\tabularnewline
\texttt{\textbf{\small divf}} & \texttt{\small rd = ra,rb} & \textsf{rd = ra $\leftarrow$ ra / rb}\tabularnewline
\texttt{\textbf{\small minf}} & \texttt{\small rd = ra,rb} & \textsf{rd = ra $\leftarrow$ (ra < rb) ? ra : rb}\tabularnewline
\texttt{\textbf{\small maxf}} & \texttt{\small rd = ra,rb} & \textsf{rd = ra $\leftarrow$ (ra >= rb) ? ra : rb}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{These uops do arithmetic on floating point numbers in various
formats as specified in the} \textsf{\emph{Floating Point Format and
Merging}} \textsf{page.}
\end{itemize}
\newpage{}\texttt{\textbf{\large maddf msubf}}\textsf{}\\
\textsf{\Large Fused Multiply Add and Subtract}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{lll}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small maddf}} & \texttt{\small rd = ra,rb,rc} & \textsf{rd = ra $\leftarrow$ (ra $\times$ rb) + rc}\tabularnewline
\texttt{\textbf{\small msubf}} & \texttt{\small rd = ra,rb,rc} & \textsf{rd = ra $\leftarrow$ (ra $\times$ rb) - rc}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{The} \texttt{\small maddf}\textsf{ and} \texttt{\small msubf}\textsf{
uops perform fused multiply and accumulate operations on three operands.}
\item \textsf{The full internal precision is preserved between the multiply
and add operations; rounding only occurs at the end.}
\item \textsf{These uops are primarily used by microcode to calculate floating
point division, square root and reciprocal.}
\end{itemize}
\newpage{}\texttt{\textbf{\large sqrtf rcpf rsqrtf}}\textsf{}\\
\textsf{\Large Square Root, Reciprocal and Reciprocal Square Root}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{lll}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small sqrtf}} & \texttt{\small rd = ra,rb} & \textsf{rd = ra $\leftarrow$ sqrt(rb)}\tabularnewline
\texttt{\textbf{\small rcpf}} & \texttt{\small rd = ra,rb} & \textsf{rd = ra $\leftarrow$ 1 / rb}\tabularnewline
\texttt{\textbf{\small rsqrtf}} & \texttt{\small rd = ra,rb} & \textsf{rd = ra $\leftarrow$ 1 / sqrt(rb)}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{These uops perform the specified unary operation on rb and
merge the result into ra (for a single precision scalar mode only)}
\item \textsf{The} \texttt{\small rcpf}\textsf{ and} \texttt{\small rsqrtf}\textsf{
uops are approximates - they do not provide the full precision results.
These approximations are in accordance with the standard x86 SSE/SSE2
semantics.}
\end{itemize}
\newpage{}\texttt{\textbf{\large cmpf}}\textsf{}\\
\textsf{\Large Compare Floating Point}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{lll}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small cmpf}}\texttt{\textbf{\emph{\small .type}}} & \texttt{\small rd = ra,rb} & \textsf{rd = ra $\leftarrow$ CompareFP(ra, rb, type) ? -1 : 0}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{This uop performs the specified comparison of} \textsf{\emph{ra}}
\textsf{and} \textsf{\emph{rb}}\textsf{. If the comparison is true,
the result is set to all '1' bits; otherwise it is zero. The result
is then merged into ra.}
\item \textsf{The} \textsf{\emph{cond}} \textsf{field in the uop encoding
holds the comparison type. The set of compare types matches the x86
SSE/SSE2 CMPxx instructions.}
\end{itemize}
\newpage{}\texttt{\textbf{\large cmpccf}}\textsf{}\\
\textsf{\Large Compare Floating Point and Generate Condition Codes}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{lll}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}}\tabularnewline
\hline
\texttt{\textbf{\small cmpccf}}\texttt{\textbf{\emph{\small .type}}} & \texttt{\small rd = ra,rb} & \textsf{rd.flags = CompareFPFlags(ra, rb)}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{This uop performs all comparisons of} \textsf{\emph{ra}} \textsf{and}
\textsf{\emph{rb}} \textsf{and produces x86 condition code flags (ZF,
PF, CF) to represent the result.}
\item \textsf{The semantics of the generated condition code flags exactly
matches the x86 SSE/SSE2 instructions} \texttt{\small COMISS}\textsf{/}\texttt{\small COMISD}\textsf{/}\texttt{\small UCOMISS}\textsf{/}\texttt{\small UCOMISD}\textsf{.}
\item \textsf{Unlike most encodings, the} \textsf{\emph{size}} \textsf{field
holds the comparison type of the two values as follows:}

\begin{itemize}
\item \texttt{\textbf{\small 00:}} \texttt{\small cmpccfp}\textsf{: single
precision ordered compare (same semantics as x86 SSE} \texttt{\small COMISS}\textsf{)}
\item \texttt{\textbf{\small 01:}} \texttt{\small cmpccfp.u}\textsf{: single
precision unordered compare (same semantics as x86 SSE} \texttt{\small UCOMISS}\textsf{)}
\item \texttt{\textbf{\small 10:}} \texttt{\small cmpccfd}\textsf{: double
precision ordered compare (same semantics as x86 SSE2} \texttt{\small COMISD}\textsf{)}
\item \texttt{\textbf{\small 11:}} \texttt{\small cmpccfd.u}\textsf{: double
precision ordered compare (same semantics as x86 SSE2} \texttt{\small UCOMISD}\textsf{)}
\end{itemize}
\end{itemize}
\newpage{}\texttt{\textbf{\large cvtf.i2s.ins cvtf.i2s.p cvtf.i2d.lo
cvtf.i2d.hi}}\textsf{}\\
\textsf{\Large Convert 32-bit Integer to Floating Point}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{llp{0.5\columnwidth}p{0.1\columnwidth}}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}} & \textsf{\textbf{Used By}}\tabularnewline
\hline
\texttt{\textbf{\small cvtf.i2s.ins}} & \texttt{\small rd = ra,rb} & \textsf{rd = ra $\leftarrow$ Int32ToFloat(rb)} & \texttt{\small CVTSI2SS}\tabularnewline
\texttt{\textbf{\small cvtf.i2s.p}} & \texttt{\small rd = zero,rb} & \parbox[t]{0.5\columnwidth}{\textsf{rd{[}31:0] = Int32ToFloat(rb{[}31:0])}\\
\textsf{rd{[}63:32] = Int32ToFloat(rb{[}63:32])}} & \texttt{\small CVTPI2PS}\tabularnewline
\texttt{\textbf{\small cvtf.i2d.lo}} & \texttt{\small rd = zero,rb} & \textsf{rd = Int32ToDouble(rb{[}31:0])} & \parbox[t]{0.1\columnwidth}{\texttt{\small CVTSI2SD}~\\
\texttt{\small CVTPI2PD}}\tabularnewline
\texttt{\textbf{\small cvtf.i2d.hi}} & \texttt{\small rd = zero,rb} & \textsf{rd = Int32ToDouble(rb{[}63:32])} & \texttt{\small CVTPI2PD}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{These uops convert 32-bit integers to single or double precision
floating point}
\item \textsf{The semantics of these instructions are identical to the semantics
of the x86 SSE/SSE2 instructions shown in the table}
\item \textsf{The uop} \textsf{\emph{size}} \textsf{field is not used by
these uops}
\end{itemize}
\newpage{}\texttt{\textbf{\large cvtf.q2s.ins cvtf.q2d}}\textsf{}\\
\textsf{\Large Convert 64-bit Integer to Floating Point}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{llp{0.5\columnwidth}p{0.1\columnwidth}}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}} & \textsf{\textbf{Used By}}\tabularnewline
\hline
\texttt{\textbf{\small cvtf.q2s.ins}} & \texttt{\small rd = ra,rb} & \textsf{rd = ra $\leftarrow$ Int64ToFloat(rb)} & \parbox[t]{0.1\columnwidth}{\texttt{\small CVTSI2SS}\textsf{}\\
\textsf{(x86-64)}}\tabularnewline
\texttt{\textbf{\small cvtf.q2d}} & \texttt{\small rd = ra} & \textsf{rd = Int64ToDouble(ra)} & \parbox[t]{0.1\columnwidth}{\texttt{\small CVTPI2PS}~\\
\textsf{(x86-64)}}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{These uops convert 64-bit integers to single or double precision
floating point}
\item \textsf{The semantics of these instructions are identical to the semantics
of the x86 SSE/SSE2 instructions shown in the table}
\item \textsf{The uop} \textsf{\emph{size}} \textsf{field is not used by
these uops}
\end{itemize}
\newpage{}\texttt{\textbf{\large cvtf.s2i cvt.s2q cvtf.s2i.p}}\textsf{}\\
\textsf{\Large Convert Single Precision Floating Point to Integer}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{llp{0.5\columnwidth}p{0.1\columnwidth}}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}} & \textsf{\textbf{Used By}}\tabularnewline
\hline
\texttt{\textbf{\small cvtf.s2i}} & \texttt{\small rd = ra} & \textsf{rd = FloatToInt32(ra{[}31:0])} & \texttt{\small CVTSS2SI}\tabularnewline
\texttt{\textbf{\small cvtf.s2i.p}} & \texttt{\small rd = ra} & \parbox[t]{0.5\columnwidth}{\textsf{rd{[}31:0] = FloatToInt32(ra{[}31:0])}\\
\textsf{rd{[}63:32] = FloatToInt32(ra{[}63:32])}} & \parbox[t]{0.1\columnwidth}{\texttt{\small CVTPS2PI}~\\
\texttt{\small CVTPS2DQ}}\tabularnewline
\texttt{\textbf{\small cvtf.s2q}} & \texttt{\small rd = ra} & \textsf{rd = FloatToInt64(ra)} & \parbox[t]{0.1\columnwidth}{\texttt{\small CVTSS2SI}\\
\textsf{(x86-64)}}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{These uops convert single precision floating point values
to 32-bit or 64-bit integers}
\item \textsf{The semantics of these instructions are identical to the semantics
of the x86 SSE/SSE2 instructions shown in the table}
\item \textsf{Unlike most encodings, the} \textsf{\emph{size}} \textsf{field
holds the rounding type of the result as follows:}

\begin{itemize}
\item \texttt{\textbf{\small x0:}}\textsf{ normal IEEE rounding (as determined
by FPSR)}
\item \texttt{\textbf{\small x1:}}\textsf{ truncate to zero}
\end{itemize}
\end{itemize}
\newpage{}\texttt{\textbf{\large cvtf.d2i cvtf.d2q cvtf.d2i.p}}\textsf{}\\
\textsf{\Large Convert Double Precision Floating Point to Integer}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{llp{0.5\columnwidth}p{0.1\columnwidth}}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}} & \textsf{\textbf{Used By}}\tabularnewline
\hline
\texttt{\textbf{\small cvtf.d2i}} & \texttt{\small rd = ra} & \textsf{rd = DoubleToInt32(ra)} & \texttt{\small CVTSD2SI}\tabularnewline
\texttt{\textbf{\small cvtf.d2i.p}} & \texttt{\small rd = ra,rb} & \parbox[t]{0.5\columnwidth}{\textsf{rd{[}63:32] = DoubleToInt32(ra)}\\
\textsf{rd{[}31:0] = DoubleToInt32(rb)}} & \parbox[t]{0.1\columnwidth}{\texttt{\small CVTPD2PI}~\\
\texttt{\small CVTPD2DQ}}\tabularnewline
\texttt{\textbf{\small cvtf.d2q}} & \texttt{\small rd = ra} & \textsf{rd = DoubleToInt64(ra)} & \parbox[t]{0.1\columnwidth}{\texttt{\small CVTSD2SI}\\
\textsf{(x86-64)}}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{These uops convert double precision floating point values
to 32-bit or 64-bit integers}
\item \textsf{The semantics of these instructions are identical to the semantics
of the x86 SSE/SSE2 instructions shown in the table}
\item \textsf{Unlike most encodings, the} \textsf{\emph{size}} \textsf{field
holds the rounding type of the result as follows:}

\begin{itemize}
\item \texttt{\textbf{\small x0:}}\textsf{ normal IEEE rounding (as determined
by FPSR)}
\item \texttt{\textbf{\small x1:}}\textsf{ truncate to zero}
\end{itemize}
\end{itemize}
\newpage{}\texttt{\textbf{\large cvtf.d2s.ins cvtf.d2s.p cvtf.s2d.lo
cvtf.s2d.hi}}\textsf{}\\
\textsf{\Large Convert Between Double Precision and Single Precision
Floating Point}{\Large \lyxline{\Large}}{\Large \par}

\begin{tabular}{llp{0.5\columnwidth}p{0.1\columnwidth}}
\textsf{\textbf{Mnemonic}} & \textsf{\textbf{Syntax}} & \textsf{\textbf{Operation}} & \textsf{\textbf{Used By}}\tabularnewline
\hline
\texttt{\textbf{\small cvtf.d2s.ins}} & \texttt{\small rd = ra,rb} & \textsf{rd = ra $\leftarrow$ DoubleToFloat(rb)} & \texttt{\small CVTSD2SS}\tabularnewline
\texttt{\textbf{\small cvtf.d2s.p}} & \texttt{\small rd = ra,rb} & \parbox[t]{0.5\columnwidth}{\textsf{rd{[}63:32] = DoubleToFloat(ra)}\\
\textsf{rd{[}31:0] = DoubleToFloat(rb)}} & \texttt{\small CVTPD2PS}\tabularnewline
\texttt{\textbf{\small cvtf.s2d.lo}} & \texttt{\small rd = zero,rb} & \textsf{rd = FloatToDouble(rb{[}31:0])} & \parbox[t]{0.1\columnwidth}{\texttt{\small CVTSS2SD}~\\
\texttt{\small CVTPS2PD}}\tabularnewline
\texttt{\textbf{\small cvtf.s2d.hi}} & \texttt{\small rd = zero,rb} & \textsf{rd = FloatToDouble(rb{[}63:32])} & \texttt{\small CVTPS2PD}\tabularnewline
\end{tabular}

\medskip{}
\textsf{\textbf{Notes:}}

\begin{itemize}
\item \textsf{These uops convert single precision floating point values
to double precision floating point values}
\item \textsf{The semantics of these instructions are identical to the semantics
of the x86 SSE/SSE2 instructions shown in the table}
\item \textsf{The uop} \textsf{\emph{size}} \textsf{field is not used by
these uops}
\end{itemize}

\chapter{\label{sec:PerformanceCounters}Performance Counters}

PTLsim maintains hundreds of performance and statistical counters
and data points as it simulates user code. In Section \ref{sec:StatisticsInfrastructure},
the basic mechanisms and data structures through which PTLsim collects
these data were disclosed, and a guide to extending the existing set
of collection points was presented.

This section is a reference listing of all the current performance
counters present in PTLsim by default. The sections below are arranged
in a hierarchical tree format, just as the data are represented in
PTLsim's data store. The types of data collected closely match the
performance counters available on modern Intel and AMD x86 processors,
as described in their respective reference manuals.


\section{General}

As described in Section \ref{sec:StatisticsInfrastructure}, PTLsim
maintains a hierarchical tree of statistical data, defined in \texttt{\footnotesize stats.h}.
The data store contains a potentially large number of snapshots of
this tree, numbered starting at 0. The final snapshot, taken just
before simulation completes, is labeled as {}``final''. Each snapshot
branch contains all of the data structures described in the next few
sections. Snapshots are enabled with the \texttt{\footnotesize -snapshot-cycles}
configuration option (Section \ref{sec:ConfigurationOptions}); if
they are disabled, only the {}``0'' and {}``final'' snapshots
are provided.


\section{Summary}

The \texttt{\footnotesize summary}{\footnotesize{} }toplevel branch
summarizes information about the simulation run across all cores:

\texttt{\textbf{\footnotesize summary:}} general information

\begin{itemize}
\item \texttt{\textbf{\footnotesize cycles:}}{\footnotesize{} }total number
of simulated cycles completed
\item \texttt{\textbf{\footnotesize insns:}}{\footnotesize{} }total number
of complete x86 instructions committed
\item \texttt{\textbf{\footnotesize uops:}}{\footnotesize{} }total number
of uops committed
\item \texttt{\textbf{\footnotesize basic\_blocks:}}{\footnotesize{} }total
number of basic blocks executed
\end{itemize}
\texttt{\textbf{\footnotesize snapshot\_uuid:}}{\footnotesize{} }the
universally unique ID (UUID) of this snapshot. This number starts
from 0 and increases to infinity.

\texttt{\textbf{\footnotesize snapshot\_name:}}{\footnotesize{} }name
of this snapshot, if any. Named snapshots can be taken by the \texttt{\footnotesize ptlcall\_snapshot()}
call within the virtual machine, or by the \texttt{\footnotesize -snapshot-now}
\emph{name} command.


\section{Simulator}

The \texttt{\footnotesize simulator}{\footnotesize{} }toplevel branch
represents information about PTLsim itself:

\texttt{\textbf{\footnotesize version:}}{\footnotesize{} }PTLsim version
information

\begin{itemize}
\item \texttt{\textbf{\footnotesize build\_timestamp:}}{\footnotesize{} }the
date and time PTLsim (specifically, \texttt{\footnotesize ptlsim.o})
was last built
\item \texttt{\textbf{\footnotesize svn\_revision:}}{\footnotesize{} }Subversion
revision number for this PTLsim version
\item \texttt{\textbf{\footnotesize svn\_timestamp:}}{\footnotesize{} }Date
and time of Subversion commit for this version
\item \texttt{\textbf{\footnotesize build\_hostname:}}{\footnotesize{} }machine
name on which PTLsim was compiled
\item \texttt{\textbf{\footnotesize build\_compiler:}}{\footnotesize{} }gcc
compiler version used to build PTLsim
\end{itemize}
\texttt{\textbf{\footnotesize run:}}{\footnotesize{} }runtime environment
information

\begin{itemize}
\item \texttt{\textbf{\footnotesize timestamp:}}{\footnotesize{} }time (in
POSIX seconds-since-epoch format) this instance of PTLsim was started
\item \texttt{\textbf{\footnotesize hostname:}}{\footnotesize{} }machine
name on which PTLsim is running
\item \texttt{\textbf{\footnotesize kernel\_version:}}{\footnotesize{} }Linux
kernel version PTLsim is running under. For PTLsim/X, this is the
domain 0 kernel version
\item \texttt{\textbf{\footnotesize hypervisor\_version:}}{\footnotesize{}
}PTLsim/X Xen hypervisor version
\item \texttt{\textbf{\footnotesize executable:}}{\footnotesize{} }the executable
file being run under simulation (userspace PTLsim only)
\item \texttt{\textbf{\footnotesize args:}}{\footnotesize{} }the arguments
to the executable file (userspace PTLsim only)
\item \texttt{\textbf{\footnotesize native\_cpuid:}}{\footnotesize{} }CPUID
(brand/model/revision) of the host machine running PTLsim
\item \texttt{\textbf{\footnotesize native\_hz:}}{\footnotesize{} }core frequency
(cycles per second) of the host machine
\end{itemize}
\texttt{\textbf{\footnotesize config:}}{\footnotesize{} }the configuration
options last passed to PTLsim for this run

\texttt{\textbf{\footnotesize performance:}}{\footnotesize{} }PTLsim
internal performance data

\begin{itemize}
\item \texttt{\textbf{\footnotesize rate:}}{\footnotesize{} }operations per
wall-clock second (i.e. in outside world, not inside the virtual machine),
averaged over entire run. These are the status lines PTLsim prints
on the console and in the log file as it runs.

\begin{itemize}
\item \texttt{\textbf{\footnotesize cycles\_per\_second:}}{\footnotesize{}
}simulated cycles completed per second
\item \texttt{\textbf{\footnotesize issues\_per\_second:}}{\footnotesize{}
}uops issued per second
\item \texttt{\textbf{\footnotesize user\_commits\_per\_second:}}{\footnotesize{}
}x86 instructions committed per second
\end{itemize}
\end{itemize}

\section{Decoder}

The \texttt{\footnotesize decoder} toplevel branch represents the
x86-to-uop decoder, basic block cache, code page cache and other common
structures:

\texttt{\textbf{\footnotesize throughput:}} total decoded entities

\begin{itemize}
\item \texttt{\textbf{\footnotesize basic\_blocks:}} total basic blocks
(uop sequence terminated by a branch) decoded
\item \texttt{\textbf{\footnotesize x86\_insns:}}{\footnotesize{} }total
x86 instructions decoded
\item \texttt{\textbf{\footnotesize uops:}}{\footnotesize{} }total uops produced
from all decoded x86 instructions
\item \texttt{\textbf{\footnotesize bytes:}}{\footnotesize{} }total bytes
in all decoded x86 instructions
\end{itemize}
\texttt{\textbf{\footnotesize bb\_decode\_type:}}{\footnotesize{} }predominant
decoder type used for each basic block

\begin{itemize}
\item \texttt{\textbf{\footnotesize all\_insns\_fast:}}{\footnotesize{} }number
of basic blocks all instructions in the basic block were in the simple
regular subset of x86 and could be decoded entirely by the fast decoder
(\texttt{\footnotesize decode-fast.cpp})
\item \texttt{\textbf{\footnotesize some\_insns\_complex:}}{\footnotesize{}
}number of basic blocks in which one or more instructions required
complex decoding
\end{itemize}
\texttt{\textbf{\footnotesize page\_crossings:}}{\footnotesize{} }alignment
of instructions within page

\begin{itemize}
\item \texttt{\textbf{\footnotesize within\_page:}}{\footnotesize{} }number
of basic blocks in which all bytes in the basic block fell within
a single page
\item \texttt{\textbf{\footnotesize crosses\_page:}}{\footnotesize{} }number
of basic blocks in which some bytes crossed a page boundary (i.e.
required two MFN invalidate locators)
\end{itemize}
\texttt{\textbf{\footnotesize bbcache:}}{\footnotesize{} }basic block
cache accesses

\begin{itemize}
\item \texttt{\textbf{\footnotesize count:}}{\footnotesize{} }basic blocks
currently in the cache (i.e. at the time the stats snapshot was made)
\item \texttt{\textbf{\footnotesize inserts:}}{\footnotesize{} }total insert
operations
\item \texttt{\textbf{\footnotesize invalidates:}}{\footnotesize{} }invalidation
operations by type

\begin{itemize}
\item \texttt{\textbf{\footnotesize smc:}}{\footnotesize{} }self modifying
code required page to be invalidated
\item \texttt{\textbf{\footnotesize dma:}}{\footnotesize{} }DMA into page
with existing translations required page to be invalidated
\item \texttt{\textbf{\footnotesize spurious:}}{\footnotesize{} }\texttt{\footnotesize exec\_page\_fault}{\footnotesize{}
}assist determined the page has now been made executable
\item \texttt{\textbf{\footnotesize reclaim:}}{\footnotesize{} }garbage collector
discarded unused LRU basic blocks
\item \texttt{\textbf{\footnotesize dirty:}}{\footnotesize{} }page was already
dirty when new translation was to be made
\item \texttt{\textbf{\footnotesize empty:}}{\footnotesize{} }page was empty
(has no basic blocks)
\end{itemize}
\end{itemize}
\texttt{\textbf{\footnotesize pagecache:}}{\footnotesize{} }physical
code page cache

\begin{itemize}
\item \texttt{\textbf{\footnotesize count:}}{\footnotesize{} }physical pages
currently in the cache (i.e. at the time the stats snapshot was made)
\item \texttt{\textbf{\footnotesize inserts:}}{\footnotesize{} }total physical
page insert operations
\item \texttt{\textbf{\footnotesize invalidates:}}{\footnotesize{} }invalidation
operations by type

\begin{itemize}
\item \texttt{\textbf{\footnotesize smc:}}{\footnotesize{} }self modifying
code required page to be invalidated
\item \texttt{\textbf{\footnotesize dma:}}{\footnotesize{} }DMA into page
with existing translations required page to be invalidated
\item \texttt{\textbf{\footnotesize spurious:}}{\footnotesize{} }\texttt{\footnotesize exec\_page\_fault}{\footnotesize{}
}assist determined the page has now been made executable
\item \texttt{\textbf{\footnotesize reclaim:}}{\footnotesize{} }garbage collector
discarded unused LRU basic blocks
\item \texttt{\textbf{\footnotesize dirty:}}{\footnotesize{} }page was already
dirty when new translation was to be made
\item \texttt{\textbf{\footnotesize empty:}}{\footnotesize{} }page was empty
(has no basic blocks)
\end{itemize}
\end{itemize}
\texttt{\textbf{\footnotesize reclaim\_rounds:}}{\footnotesize{} }number
of times the memory manager attempted to reclaim unused basic blocks
(possibly with several attempts until enough memory was available)


\section{Out of Order Core}

The out of order core is represented by the \texttt{\footnotesize ooocore}
toplevel branch of the statistics data store tree:

\texttt{\textbf{\footnotesize cycles:}} total number of processor
cycles simulated

\texttt{\textbf{\footnotesize fetch:}} fetch stage statistics

\begin{itemize}
\item \texttt{\textbf{\footnotesize stop:}} totals up the reasons why fetching
finally stopped in each cycle

\begin{itemize}
\item \texttt{\textbf{\footnotesize stalled:}}{\footnotesize{} }fetch unit
was already stalled in the previous cycle
\item \texttt{\textbf{\footnotesize icache\_miss:}}{\footnotesize{} }an instruction
cache miss prevented further fetches
\item \texttt{\textbf{\footnotesize fetchq\_full:}}{\footnotesize{} }the
uop fetch queue is full
\item \texttt{\textbf{\footnotesize bogus\_rip:}}{\footnotesize{} }speculative
execution redirected the fetch unit to an inaccessible (or non-executable)
page. The fetch unit remains stalled in this state until the mis-speculation
is resolved.
\item \texttt{\textbf{\footnotesize microcode\_assist:}}{\footnotesize{}
}microcode assist must wait for pipeline to empty
\item \texttt{\textbf{\footnotesize branch\_taken:}} taken branches to non-sequential
addresses always stop fetching
\item \texttt{\textbf{\footnotesize full\_width:}} the maximum fetch width
was utilized without encountering any of the events above
\end{itemize}
\item \texttt{\textbf{\footnotesize opclass:}}{\footnotesize{} }histogram
of how many uops of various operation classes passed through the fetch
unit. The operation classes are defined in \texttt{\footnotesize ptlhwdef.h}
and assigned to various opcodes in \texttt{\footnotesize ptlhwdef.cpp}.
\item \texttt{\textbf{\footnotesize width:}} histogram of the fetch width
actually used on each cycle
\item \texttt{\textbf{\footnotesize blocks:}} blocks of x86 instructions
fetched (typically the processor can read at most e.g. 16 bytes out
of a 64 byte instruction cache line per cycle)
\item \texttt{\textbf{\footnotesize uops:}} total number of uops fetched
\item \texttt{\textbf{\footnotesize user\_insns:}} total number of x86 instructions
fetched
\end{itemize}
\texttt{\textbf{\footnotesize frontend:}} frontend pipeline (decode,
allocate, rename) statistics

\begin{itemize}
\item \texttt{\textbf{\footnotesize status:}} totals up the reasons why
frontend processing finally stopped in each cycle

\begin{itemize}
\item \texttt{\textbf{\footnotesize complete:}} all uops were successfully
allocated and renamed
\item \texttt{\textbf{\footnotesize fetchq\_empty:}} no more uops were available
for allocation
\item \texttt{\textbf{\footnotesize rob\_full:}} reorder buffer (ROB) was
full
\item \texttt{\textbf{\footnotesize physregs\_full:}} physical register
file was full even though an ROB slot was free
\item \texttt{\textbf{\footnotesize ldq\_full:}} load queue was full (too
many loads in the pipeline) even though physical registers were available
\item \texttt{\textbf{\footnotesize stq\_full:}} store queue was full (too
many stores in the pipeline)
\end{itemize}
\item \texttt{\textbf{\footnotesize width:}} histogram of the frontend width
actually used on each cycle
\item \texttt{\textbf{\footnotesize renamed:}} summarizes the type of renaming
that occurred for each uop (of the destination, not the operands)

\begin{itemize}
\item \texttt{\textbf{\footnotesize none:}} uop did not rename its destination
(primarily for stores and branches)
\item \texttt{\textbf{\footnotesize reg:}} uop renamed destination architectural
register
\item \texttt{\textbf{\footnotesize flags:}} uop renamed one or more of
the ZAPS, CF, OF flag sets but had no destination architectural register
\item \texttt{\textbf{\footnotesize reg\_and\_flags:}} uop renamed one or
more of the ZAPS, CF, OF flag sets as well as a destination architectural
register
\end{itemize}
\item \texttt{\textbf{\footnotesize alloc:}} summarizes the type of resource
allocation that occurred for each uop (in addition to its ROB slot):

\begin{itemize}
\item \texttt{\textbf{\footnotesize reg:}} uop was allocated a physical
register
\item \texttt{\textbf{\footnotesize ldreg:}} uop was a load and was allocated
both a physical register and a load queue entry
\item \texttt{\textbf{\footnotesize sfr:}} uop was a store and was allocated
a store forwarding register (SFR), a.k.a. store queue entry
\item \texttt{\textbf{\footnotesize br:}} uop was a branch and was allocated
branch-related resources (possibly including a destination physical
register)
\end{itemize}
\end{itemize}
\texttt{\textbf{\footnotesize dispatch:}} dispatch unit statistics

\begin{itemize}
\item \texttt{\textbf{\footnotesize source:}} totals up where each operand
to each uop currently resided at the time the uop was dispatched.
These statistics are broken out by cluster.

\begin{itemize}
\item \texttt{\textbf{\footnotesize waiting:}} how many operands were waiting
(i.e. not yet ready)
\item \texttt{\textbf{\footnotesize bypass:}} how many operands would come
from the bypass network if the uop were immediately issued
\item \texttt{\textbf{\footnotesize physreg:}} how many operands were already
written back to physical registers
\item \texttt{\textbf{\footnotesize archreg:}} how many operands would be
obtained from architectural registers
\end{itemize}
\item \texttt{\textbf{\footnotesize cluster:}} tracks the number of uops
issued to each cluster (or issue queue) in the processor. This list
will vary depending on the processor configuration. The value \emph{none}
means that no cluster could accept the uop because all issue queues
were full.
\item \texttt{\textbf{\footnotesize redispatch:}} statistics on the redispatch
speculation recovery rmechanism (Section \ref{sec:SpeculationRecovery})

\begin{itemize}
\item \texttt{\textbf{\footnotesize trigger\_uops}} measures how many uops
triggered redispatching because of a mis-speculation. This number
does not count towards the statistics below.
\item \texttt{\textbf{\footnotesize deadlock\_flushes}} measures how many
times the pipeline must be flushed to resolve a deadlock.
\item \texttt{\textbf{\footnotesize dependent\_uops}} is a histogram of
how many uops depended on each trigger uop, not including the trigger
uop itself.
\end{itemize}
\end{itemize}
\texttt{\textbf{\footnotesize issue:}} issue statistics

\begin{itemize}
\item \texttt{\textbf{\footnotesize result:}} histogram of the final disposition
of issuing each uop

\begin{itemize}
\item \texttt{\textbf{\footnotesize no-fu:}} no functional unit was available
within the uop's assigned cluster even though it was already issued
\item \texttt{\textbf{\footnotesize replay:}} uop attempted to execute but
could not complete, so it must remain in the issue queue to be replayed.
This event generally occurs when a load or store detects a previously
unknown forwarding dependency on a prior store, when the data to actually
store is not yet available, or when insufficient resources are available
to complete the memory operation. Details are given in Sections \ref{sec:IssuingLoads}
and \ref{sec:SplitPhaseStores}.
\item \texttt{\textbf{\footnotesize misspeculation:}} uop mis-speculated
and now all uops after and including the issued uop must be annulled.
This generally occurs with loads (Section \ref{sec:IssuingLoads})
and stores (Section \ref{sub:AliasCheck}) when unaligned accesses
or load-store aliasing occurs. This event is handled in accordance
with Section \ref{sec:SpeculationRecovery}.
\item \texttt{\textbf{\footnotesize refetch:}} uop and all subsequent uops
must be re-fetched to be decoded differently. For example, unaligned
loads and stores take this path so they can be cracked into two parts
after being refetched.
\item \texttt{\textbf{\footnotesize branch\_mispredict:}} uop was a branch
and mispredicted, such that all uops after (but not including) the
branch uop must be annulled. See Section \ref{sec:SpeculationAndRecovery}
for details.
\item \texttt{\textbf{\footnotesize exception:}} uop caused an exception
(though this may not be a user visible error due to speculative execution)
\item \texttt{\textbf{\footnotesize complete:}} uop completed successfully.
Note that this does \emph{not} mean the result is immediately ready;
for loads it simply means the request was issued to the cache.
\end{itemize}
\item \texttt{\textbf{\footnotesize source:}} totals up where each operand
to each uop was read from as it was issued

\begin{itemize}
\item \texttt{\textbf{\footnotesize bypass:}} how many operands came directly
off the bypass network
\item \texttt{\textbf{\footnotesize physreg:}} how many operands were read
from physical registers
\item \texttt{\textbf{\footnotesize archreg:}} how many operands were read
from committed architectural registers
\end{itemize}
\item \texttt{\textbf{\footnotesize width:}} histogram of the issue width
actually used on each cycle in each cluster. This object is further
broken down by cluster, since various clusters have different issue
width and policies.
\item \texttt{\textbf{\footnotesize opclass:}} histogram of how many uops
of various operation classes were issued. The operation classes are
defined in \texttt{\footnotesize ptlhwdef.h} and assigned to various
opcodes in \texttt{\footnotesize ptlhwdef.cpp}.
\end{itemize}
\texttt{\textbf{\footnotesize writeback:}} writeback stage statistics

\begin{itemize}
\item \texttt{\textbf{\footnotesize total\_writebacks:}} total number of
results written back to the physical register file
\item \texttt{\textbf{\footnotesize transient:}} transient versus persistent
values

\begin{itemize}
\item \texttt{\textbf{\footnotesize transient:}} the result technically
does not have to be written back to the physical register file at
all, since all consumers sourced the value off the bypass network
and the result is no longer available since the destination architectural
register pointing to it has since been renamed.
\item \texttt{\textbf{\footnotesize persistent:}} all values which do not
meet the conditions above and hence must still be written back
\end{itemize}
\item \texttt{\textbf{\footnotesize width:}} histogram of the writeback
width actually used on each cycle in each cluster. This object is
further broken down by cluster, since various clusters have different
issue width and policies.
\end{itemize}
\texttt{\textbf{\footnotesize commit:}} commit unit statistics

\begin{itemize}
\item \texttt{\textbf{\footnotesize uops:}} total number of uops committed
\item \texttt{\textbf{\footnotesize insns:}} total number of complete x86
instructions committed
\item \texttt{\textbf{\footnotesize result:}} histogram of the final disposition
of attempting to commit each uop

\begin{itemize}
\item \texttt{\textbf{\footnotesize none:}} one or more uops comprising
the x86 instruction at the head of the ROB were not yet ready to commit,
so commitment is terminated for that cycle
\item \texttt{\textbf{\footnotesize ok:}} result was successfully committed
\item \texttt{\textbf{\footnotesize exception:}} result caused a genuine
user visible exception. In userspace PTLsim, this will terminate the
simulation. In full system PTLsim/X, this is a normal and frequent
event. Floating point state dirty faults are counted under this category.
\item \texttt{\textbf{\footnotesize skipblock:}} This occurs in rare cases
when the processor must skip over the currently executing instruction
(such as in pathological cases of the \texttt{\footnotesize rep} x86
instructions).
\item \texttt{\textbf{\footnotesize barrier:}} the processor encountered
a barrier instruction, such as a system call, assist or pipeline flush.
The frontend has already been stopped and fetching has been redirected
to the code to handle the barrier; this condition simply commits the
barrier instruction itself.
\item \texttt{\textbf{\footnotesize smc:}} self modifying code: the instruction
attempting to commit has been modified since it was last decoded (see
Section \ref{sec:SelfModifyingCode})
\item \texttt{\textbf{\footnotesize stop:}} special case for when the simulation
is to be stopped after committing a certain number of x86 instructions
(e.g. via the \texttt{\footnotesize -stopinsns} option in Section
\ref{sec:ConfigurationOptions}).
\end{itemize}
\item \texttt{\textbf{\footnotesize setflags:}} how many uops updated the
condition code flags as they committed

\begin{itemize}
\item \texttt{\textbf{\footnotesize yes:}} how many uops updated at least
one of the ZAPS, CF, OF flag sets (the \texttt{\small REG\_flags}
internal architectural register)
\item \texttt{\textbf{\footnotesize no:}} how many uops did not update any
flags
\end{itemize}
\item \texttt{\textbf{\footnotesize freereg:}} how many uops were able to
free the old physical register mapped to their architectural destination
register at commit time

\begin{itemize}
\item \texttt{\textbf{\footnotesize pending:}} old physical register was
still referenced within the pipeline or by one or more rename table
entries
\item \texttt{\textbf{\footnotesize free:}} old physical register could
be immediately freed
\end{itemize}
\item \texttt{\textbf{\footnotesize free\_regs\_recycled:}} how many physical
registers were recycled (garbage collected) later than normal because
of one of the conditions above
\item \texttt{\textbf{\footnotesize width:}} histogram of the issue width
actually used on each cycle in each cluster. This object is further
broken down by cluster, since various clusters have different issue
width and policies.
\item \texttt{\textbf{\footnotesize opclass:}} histogram of how many uops
of various operation classes were issued. The operation classes are
defined in \texttt{\footnotesize ptlhwdef.h} and assigned to various
opcodes in \texttt{\small ptlhwdef.cpp}.
\end{itemize}
\texttt{\textbf{\footnotesize branchpred:}} branch predictor statistics

\begin{itemize}
\item \texttt{\textbf{\footnotesize predictions:}} total number of branch
predictions of any type
\item \texttt{\textbf{\footnotesize updates:}} total number of branch predictor
updates of any type
\item \texttt{\textbf{\footnotesize cond:}} conditional branch (\texttt{\footnotesize br.cc}{\footnotesize{}
}uop) prediction outcomes, broken down into correct predictions and
mispredictions
\item \texttt{\textbf{\footnotesize indir:}} indirect branch (\texttt{\footnotesize jmp}{\footnotesize{}
}uop) prediction outcomes, broken down into correct predictions and
mispredictions
\item \texttt{\textbf{\footnotesize return:}} return (\texttt{\footnotesize jmp}{\footnotesize{}
}uop with \texttt{\footnotesize BRANCH\_HINT\_RET}{\footnotesize{}
}flag) prediction outcomes, broken down into correct predictions and
mispredictions
\item \texttt{\textbf{\footnotesize summary:}}{\footnotesize{} }summary of
all prediction outcomes of the three types above, broken down into
correct predictions and mispredictions
\item \texttt{\textbf{\footnotesize ras:}}{\footnotesize{} }return address
stack (RAS) operations

\begin{itemize}
\item \texttt{\textbf{\footnotesize push:}}{\footnotesize{} }RAS pushes on
calls
\item \texttt{\textbf{\footnotesize push\_overflows:}}{\footnotesize{} }RAS
pushes on calls in which the RAS overflowed
\item \texttt{\textbf{\footnotesize pop:}}{\footnotesize{} }RAS pops on returns
\item \texttt{\textbf{\footnotesize pop\_underflows:}}{\footnotesize{} }RAS
pops on returns in which the RAS was empty
\item \texttt{\textbf{\footnotesize annuls:}}{\footnotesize{} }annulment
operations in which speculative updates to the RAS were rolled back
\end{itemize}
\end{itemize}

\section{Cache Subsystem}

The cache subsystem is listed under the \texttt{\footnotesize ooocore/dcache}
branch.

\texttt{\textbf{\footnotesize load:}} load unit statistics

\begin{itemize}
\item \texttt{\textbf{\footnotesize issue:}} histogram of the final disposition
of issuing each load uop

\begin{itemize}
\item \texttt{\textbf{\footnotesize complete:}} cache hit
\item \texttt{\textbf{\footnotesize miss:}} L1 cache miss, and possibly
lower levels as well (Sections \ref{sec:CacheMissHandling} and \ref{sec:InitiatingCacheMiss})
\item \texttt{\textbf{\footnotesize exception:}} load generated an exception
(typically a page fault), although the exception may still be speculative
(Section \ref{sec:IssuingLoads})
\item \texttt{\textbf{\footnotesize ordering:}} load was misordered with
respect to stores (Section \ref{sub:AliasCheck})
\item \texttt{\textbf{\footnotesize unaligned:}} load was unaligned and
will need to be re-executed as a pair of low and high loads (Sections
\ref{sub:UnalignedLoadsAndStores} and \ref{sec:IssuingLoads})
\item \texttt{\textbf{\footnotesize replay:}} histogram of events in which
a load needed to be replayed (Section \ref{sec:IssuingLoads})

\begin{itemize}
\item \texttt{\textbf{\footnotesize sfr-addr-and-data-not-ready:}}{\footnotesize{}
}load was predicted to forward data from a prior store (Section \ref{sub:AliasCheck}),
but neither the address nor the data of that store has resolved yet
\item \texttt{\textbf{\footnotesize sfr-addr-not-ready:}}{\footnotesize{}
}load was predicted to forward data from a prior store, but the address
of that store has not resolved yet
\item \texttt{\textbf{\footnotesize sfr-data-not-ready:}}{\footnotesize{}
}load address matched a prior store in the store queue, but the data
that store should write has not resolved yet
\item \texttt{\textbf{\footnotesize missbuf-full:}}{\footnotesize{} }load
missed the cache but the miss buffer and/or LFRQ (Section \ref{sec:InitiatingCacheMiss})
was full at the time
\end{itemize}
\end{itemize}
\item \texttt{\textbf{\footnotesize hit:}}{\footnotesize{} }histogram of
the cache hierarchy level each load finally hit

\begin{itemize}
\item \texttt{\textbf{\footnotesize L1:}}{\footnotesize{} }L1 cache hit
\item \texttt{\textbf{\footnotesize L2:}}{\footnotesize{} }L1 cache miss,
L2 cache hit
\item \texttt{\textbf{\footnotesize L3:}}{\footnotesize{} }L1 and L2 cache
miss, L3 cache hit
\item \texttt{\textbf{\footnotesize mem:}}{\footnotesize{} }all caches missed;
value read from main memory
\end{itemize}
\item \texttt{\textbf{\footnotesize forward:}}{\footnotesize{} }histogram
of which sources were used to fill each load

\begin{itemize}
\item \texttt{\textbf{\footnotesize cache:}}{\footnotesize{} }how many loads
obtained all their data from the cache
\item \texttt{\textbf{\footnotesize sfr:}}{\footnotesize{} }how many loads
obtained all their data from a prior store in the pipeline (i.e. load
completely overlapped that store)
\item \texttt{\textbf{\footnotesize sfr-and-cache:}}{\footnotesize{} }how
many loads obtained their data from a combination of the cache and
a prior store
\end{itemize}
\item \texttt{\textbf{\footnotesize dependency:}}{\footnotesize{} }histogram
of how loads related to previous stores

\begin{itemize}
\item \texttt{\textbf{\footnotesize independent:}}{\footnotesize{} }load
was independent of any store currently in the pipeline
\item \texttt{\textbf{\footnotesize predicted-alias-unresolved:}}{\footnotesize{}
}load was stalled because the load store alias predictor (LSAP) predicted
that an earlier store would overlap the load's address address even
though that earlier store's address was unresolved (Section \ref{sub:AliasCheck})
\item \texttt{\textbf{\footnotesize stq-address-match:}}{\footnotesize{}
}load depended on an earlier store still found in the store queue
\end{itemize}
\item \texttt{\textbf{\footnotesize type:}}{\footnotesize{} }histogram of
the type of each load uop

\begin{itemize}
\item \texttt{\textbf{\footnotesize aligned:}}{\footnotesize{} }normal aligned
loads
\item \texttt{\textbf{\footnotesize unaligned:}}{\footnotesize{} }special
unaligned load uops \texttt{\small ld.lo} or \texttt{\small ld.hi}
(Section \ref{sub:UnalignedLoadsAndStores})
\item \texttt{\textbf{\footnotesize internal:}}{\footnotesize{} }loads from
PTLsim space by microcode
\end{itemize}
\item \texttt{\textbf{\footnotesize size:}}{\footnotesize{} }histogram of
the size in bytes of each load uop
\item \texttt{\textbf{\footnotesize transfer-L2-to-L1:}}{\footnotesize{}
}histogram of the types of L2 to L1 line transfers that occurred (Section
\ref{sec:CacheHierarchy})

\begin{itemize}
\item \texttt{\textbf{\footnotesize full-L2-to-L1:}}{\footnotesize{} }all
bytes in cache line were transferred from L2 to L1 cache
\item \texttt{\textbf{\footnotesize partial-L2-to-L1:}}{\footnotesize{} }some
bytes in the L1 line were already valid (because of stores to those
bytes), but the remaining bytes still need to be fetched
\item \texttt{\textbf{\footnotesize L2-to-L1I:}}{\footnotesize{} }all bytes
in the L2 line were transferred into the L1 instruction cache
\end{itemize}
\item \texttt{\textbf{\footnotesize dtlb:}}{\footnotesize{} }data cache translation
lookaside buffer hit versus miss rate (Section \ref{sec:TranslationLookasideBuffers})
\end{itemize}
\texttt{\textbf{\footnotesize fetch:}}{\footnotesize{} }instruction
fetch unit statistics (Section \ref{sec:FetchStage})

\begin{itemize}
\item \texttt{\textbf{\footnotesize hit:}}{\footnotesize{} }histogram of
the cache hierarchy level each fetch finally hit

\begin{itemize}
\item \texttt{\textbf{\footnotesize L1:}}{\footnotesize{} }L1 cache hit
\item \texttt{\textbf{\footnotesize L2:}}{\footnotesize{} }L1 cache miss,
L2 cache hit
\item \texttt{\textbf{\footnotesize L3:}}{\footnotesize{} }L1 and L2 cache
miss, L3 cache hit
\item \texttt{\textbf{\footnotesize mem:}}{\footnotesize{} }all caches missed;
value read from main memory
\end{itemize}
\item \texttt{\textbf{\footnotesize itlb:}}{\footnotesize{} }instruction
cache translation lookaside buffer hit versus miss rate (Section \ref{sec:TranslationLookasideBuffers})
\end{itemize}
\texttt{\textbf{\footnotesize prefetches:}}{\footnotesize{} }prefetch
engine statistics

\begin{itemize}
\item \texttt{\textbf{\footnotesize in-L1:}}{\footnotesize{} }requested data
already in L1 cache
\item \texttt{\textbf{\footnotesize in-L2:}}{\footnotesize{} }requested data
already in L2 cache (and possibly also in L1 cache)
\item \texttt{\textbf{\footnotesize required:}}{\footnotesize{} }prefetch
was actually required (data was not cached or was in L3 or lower levels)
\end{itemize}
\texttt{\textbf{\footnotesize missbuf:}}{\footnotesize{} }miss buffer
performance (Sections \ref{sec:InitiatingCacheMiss} and \ref{sec:FillingCacheMiss})

\begin{itemize}
\item \texttt{\textbf{\footnotesize inserts:}}{\footnotesize{} }total number
of lines inserted into the miss buffer
\item \texttt{\textbf{\footnotesize delivers:}}{\footnotesize{} }total number
of lines delivered to various cache hierarchy levels from the miss
buffer

\begin{itemize}
\item \texttt{\textbf{\footnotesize mem-to-L3:}}{\footnotesize{} }deliver
line from main memory to the L3 cache
\item \texttt{\textbf{\footnotesize L3-to-L2:}}{\footnotesize{} }deliver
line to the L3 cache to the L2 cache
\item \texttt{\textbf{\footnotesize L2-to-L1D:}}{\footnotesize{} }deliver
line from the L2 cache to the L1 data cache
\item \texttt{\textbf{\footnotesize L2-to-L1I:}}{\footnotesize{} }deliver
line from the L2 cache to the L1 instruction cache
\end{itemize}
\end{itemize}
\texttt{\textbf{\footnotesize lfrq:}}{\footnotesize{} }load fill request
queue (LFRQ) performance (Sections \ref{sec:InitiatingCacheMiss}
and \ref{sec:FillingCacheMiss})

\begin{itemize}
\item \texttt{\textbf{\footnotesize inserts:}}{\footnotesize{} }total number
of loads inserted into the LFRQ
\item \texttt{\textbf{\footnotesize wakeups:}}{\footnotesize{} }total number
of loads awakened from the LFRQ
\item \texttt{\textbf{\footnotesize annuls:}}{\footnotesize{} }total number
of loads annulled in the LFRQ (after they were annulled in the processor
core)
\item \texttt{\textbf{\footnotesize resets:}}{\footnotesize{} }total number
of LFRQ resets (all entries cleared)
\item \texttt{\textbf{\footnotesize total-latency:}}{\footnotesize{} }total
latency in cycles of all loads passing through the LFRQ
\item \texttt{\textbf{\footnotesize average-miss-latency:}}{\footnotesize{}
}average load latency, weighted by cache level hit and latency to
that level
\item \texttt{\textbf{\footnotesize width:}}{\footnotesize{} }histogram of
how many loads were awakened per cycle by the LFRQ
\end{itemize}
\texttt{\textbf{\footnotesize store:}}{\footnotesize{} }store unit
statistics

\begin{itemize}
\item \texttt{\textbf{\footnotesize issue:}}{\footnotesize{} }histogram of
the final disposition of issuing each store uop

\begin{itemize}
\item \texttt{\textbf{\footnotesize complete:}}{\footnotesize{} }store completed
without problems
\item \texttt{\textbf{\footnotesize exception:}}{\footnotesize{} }store generated
an exception (typically a page fault), although the exception may
still be speculative (Section \ref{sec:StoreMerging})
\item \texttt{\textbf{\footnotesize ordering:}}{\footnotesize{} }store detected
that a later load in program order aliased the store but was issued
earlier than the store (Section \ref{sub:AliasCheck})
\item \texttt{\textbf{\footnotesize unaligned:}}{\footnotesize{} }store was
unaligned and will need to be re-executed as a pair of low and high
stores (Sections \ref{sub:UnalignedLoadsAndStores})
\item \texttt{\textbf{\footnotesize replay:}}{\footnotesize{} }histogram
of events in which a store needed to be replayed (Sections \ref{sec:SplitPhaseStores}
and \ref{sec:StoreMerging})

\begin{itemize}
\item \texttt{\textbf{\footnotesize wait-sfraddr-sfrdata:}}{\footnotesize{}
}neither the address nor the data of a prior store this store inherits
some of its data from was ready
\item \texttt{\textbf{\footnotesize wait-sfraddr:}}{\footnotesize{} }the
data of a prior store was ready but its address was still unavailable
\item \texttt{\textbf{\footnotesize wait-sfrdata:}}{\footnotesize{} }the
address of a prior store was ready but its data was still unavailable
\item \texttt{\textbf{\footnotesize wait-storedata-sfraddr-sfrdata:}}{\footnotesize{}
}the actual data value to store was not ready (Section \ref{sec:SplitPhaseStores}),
in addition to having neither the data nor the address of a prior
store (Section \ref{sec:StoreMerging})
\item \texttt{\textbf{\footnotesize wait-storedata-sfraddr:}}{\footnotesize{}
}the actual data value to store was not ready (Section \ref{sec:SplitPhaseStores}),
in addition to not having the address of the prior store (Section
\ref{sec:StoreMerging})
\item \texttt{\textbf{\footnotesize wait-storedata-sfrdata:}}{\footnotesize{}
}the actual data value to store was not ready (Section \ref{sec:SplitPhaseStores}),
in addition to not having the data from the prior store (Section \ref{sec:StoreMerging})
\end{itemize}
\end{itemize}
\item \texttt{\textbf{\footnotesize forward:}}{\footnotesize{} }histogram
of which sources were used to construct the merged store buffer:

\begin{itemize}
\item \texttt{\textbf{\footnotesize zero:}}{\footnotesize{} }no prior store
overlapping the current store was found in the pipeline
\item \texttt{\textbf{\footnotesize sfr:}}{\footnotesize{} }data from a prior
store in the pipeline was merged with the value to be stored to form
the final store buffer
\end{itemize}
\item \texttt{\textbf{\footnotesize type:}}{\footnotesize{} }histogram of
the type of each store uop

\begin{itemize}
\item \texttt{\textbf{\footnotesize aligned:}}{\footnotesize{} }normal aligned
store
\item \texttt{\textbf{\footnotesize unaligned:}}{\footnotesize{} }special
unaligned store uops \texttt{\small st.lo} or \texttt{\small st.hi}
(Section \ref{sub:UnalignedLoadsAndStores})
\item \texttt{\textbf{\footnotesize internal:}}{\footnotesize{} }stores to
PTLsim space by microcode
\end{itemize}
\item \texttt{\textbf{\footnotesize size:}}{\footnotesize{} }histogram of
the size in bytes of each store uop
\item \texttt{\textbf{\footnotesize commit:}}{\footnotesize{} }histogram
of how stores are committed

\begin{itemize}
\item \texttt{\textbf{\footnotesize direct:}}{\footnotesize{} }store committed
directly to the data cache in the commit stage (Section \ref{sec:CommitStage})
\end{itemize}
\item \texttt{\textbf{\footnotesize commits:}}{\footnotesize{} }total number
of committed uops
\item \texttt{\textbf{\footnotesize usercommits:}}{\footnotesize{} }total
number of committed x86 instructions
\item \texttt{\textbf{\footnotesize issues:}}{\footnotesize{} }total number
of uops issued. This includes uops issued more than once by through
replay (Section \ref{sec:Scheduling}).
\item \texttt{\textbf{\footnotesize ipc:}}{\footnotesize{} }Instructions
Per Cycle (IPC) statistics

\begin{itemize}
\item \texttt{\textbf{\footnotesize commit-in-uops:}}{\footnotesize{} }average
number of uops committed per cycle
\item \texttt{\textbf{\footnotesize issue-in-uops:}}{\footnotesize{} }average
number of uops issued per cycle
\item \texttt{\textbf{\footnotesize commit-in-user-insns:}}{\footnotesize{}
}average number of x86 instructions committed per cycle\\
\textbf{\emph{}}\\
\textbf{\emph{NOTE:}} Because one x86 instruction may be broken up
into numerous uops, it is \textbf{\emph{\underbar{never}}} appropriate
to compare IPC figures for committed x86 instructions per clock with
IPC values from a RISC machine. Furthermore, different x86 implementations
use varying numbers of uops per x86 instruction as a matter of encoding,
so even comparing the uop based IPC between x86 implementations or
RISC-like machines is inaccurate. Users are strongly advised to use
relative performance measures instead (e.g. total cycles taken to
complete a given benchmark).
\end{itemize}
\end{itemize}
\texttt{\textbf{\footnotesize simulator:}}{\footnotesize{} }describes
the performance of PTLsim itself. Useful for tuning the simulator.

\begin{itemize}
\item \texttt{\textbf{\footnotesize total\_time:}}{\footnotesize{} }total
time in seconds \emph{(not simulated cycles!)} spent in various parts
of the simulator. Please refer to the source code (in \texttt{\footnotesize ooocore.cpp})
for the range of code each time value corresponds to.
\item \texttt{\textbf{\footnotesize cputime:}}{\footnotesize{} }PTLsim simulator
performance

\begin{itemize}
\item \texttt{\textbf{\footnotesize fetch:}} seconds spent in fetch stage
\item \texttt{\textbf{\footnotesize decode:}} seconds spent decoding instructions
(in decoder subsystem)
\item \texttt{\textbf{\footnotesize rename:}} seconds spent in allocate
and rename stage
\item \texttt{\textbf{\footnotesize frontend:}} seconds spent in frontend
stages
\item \texttt{\textbf{\footnotesize dispatch:}} seconds spent in dispatch
stage
\item \texttt{\textbf{\footnotesize issue:}} seconds spent in ALU issue
stage, not including loads and stores
\item \texttt{\textbf{\footnotesize issueload:}} seconds spent issuing loads
\item \texttt{\textbf{\footnotesize issuestore:}} seconds spent issuing
stores
\item \texttt{\textbf{\footnotesize complete:}} seconds spent in completion
stage
\item \texttt{\textbf{\footnotesize transfer:}} seconds spent in transfer
stage
\item \texttt{\textbf{\footnotesize writeback:}} seconds spent in writeback
stage
\item \texttt{\textbf{\footnotesize commit:}} seconds spent in commit stage
\end{itemize}
\end{itemize}

\section{External Events}

\begin{itemize}
\item \texttt{\textbf{\footnotesize assists:}}{\footnotesize{} }histogram
of microcode assists invoked from any core
\item \textbf{traps:} histogram of x86 interrupt vectors (traps) invoked
from any core (PTLsim/X only)
\end{itemize}
\newpage{}

\begin{thebibliography}{1}
\bibitem{PTLsimISPASS}M. Yourst. \emph{PTLsim: A Cycle Accurate Full
System x86-64 Microarchitectural Simulator.} ISPASS 2007, April 2007.

\bibitem{XenSource}\href{http://www.xensource.com/xen/}{\emph{XenSource
Community Web Site.}}

\bibitem{XenCambridge}\href{http://www.cl.cam.ac.uk/Research/SRG/netos/xen/}{\emph{Xen
page at Cambridge.}}

\bibitem{Xen2Overview}\href{http://www.cl.cam.ac.uk/netos/papers/2004-xen-ols.pdf}{\emph{Xen
and the Art of Virtualization.} I. Pratt et al. Ottowa Linux Symposium
2004.}

\bibitem{XenCambridge}\href{http://www.cl.cam.ac.uk/Research/SRG/netos/xen/}{\emph{Xen
page at Cambridge.}}

\bibitem{Xen2Overview}\href{http://www.cl.cam.ac.uk/netos/papers/2004-xen-ols.pdf}{\emph{Xen
and the Art of Virtualization.} I. Pratt et al. Ottowa Linux Symposium
2004.}

\bibitem{Xen3}\href{http://www.cl.cam.ac.uk/netos/papers/2006-xen-fosdem.ppt}{\emph{Xen
3.0 Virtualization.} I. Pratt et al. FOSDEM 2006.}

\bibitem{XenIntroWiki}\href{http://wiki.xensource.com/xenwiki/XenIntro}{\emph{Introduction
to Xen 3.0.}}

\bibitem{XenPerformance}\href{http://www.cl.cam.ac.uk/Research/SRG/netos/xen/performance.html}{\emph{Xen
Performance Study.}}

\bibitem{QEMU}\href{http://www.qemu.org/qemu-tech.html}{\emph{QEMU
Internals.} F. Bellard. Tech Report, 2006.}

\bibitem{Bochs}\href{http://bochs.sourceforge.net/}{\emph{Bochs
IA-32 Emulator Project.}}

\bibitem{VMware}\href{http://www.usenix.org/event/usenix01/sugerman/sugerman.pdf}{\emph{Virtualizing
I/O Devices on VMware Workstation's Hosted Virtual Machine Monitor.}
J. Sugerman et al.}

\bibitem{Simics}\href{http://www.simics.com}{\emph{Simics.}}

\bibitem{SimNow}\href{http://www.hotchips.org/archives/hc16/2_Mon/15_HC16_Sess4_Pres1_bw.pdf}{\emph{SimNow:
Fast Platform Simulation Purely in Software.} R. Bedichek (AMD). Hot
Chips 2004.}

\bibitem{Intel-VT}\href{http://download.intel.com/design/Pentium4/manuals/25366820.pdf}{\emph{IA-32
Intel Architecture Software Developer's Manual, Volume 3A: System
Programming Guide, Part 1,} Chapter 19, {}``Introduction to Virtual
Machine Extensions''.}

\bibitem{AMD-SVM}\href{http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf}{\emph{AMD64
Architecture Programmer's Manual, Volume 2: System Programming,} Chapter
15, {}``Secure Virtual Machine''.}

\begin{singlespace}
\bibitem{TransmetaPatent.TBit}E. Kelly et al. \emph{Translated memory
protection apparatus for an advanced microprocessor.} U.S. Patent
6199152, filed 22 Aug 1996. Assn. Transmeta Corp.

\bibitem{TransmetaPatent.SubPageTBit}J. Banning et al. \emph{Fine
grain translation discrimination.} U.S. Patent 6363336, filed 13 Oct
1999. Assn. Transmeta Corp.

\bibitem{TransmetaPatent.SelfRevalTrans}J. Banning et al. \emph{Translation
consistency checking for modified target instructions by comparing
to original copy.} U.S. Patent 6594821, filed 30 Mar 2000. Assn. Transmeta
Corp.

\bibitem{IBMDaisyIEEE}K. Ebcioglu et al. \emph{Dynamic Binary Translation
and Optimization}. IEEE Trans. Computers, June 2001.

\bibitem{IBMDaisyTechReport}K. Ebcioglu, E. Altman. \emph{DAISY:
Dynamic Compilation for 100\% Architectural Compatibility.} IBM Research
Report RC 20538, 5 Aug 1996.

\bibitem{IBMDaisyManual}E. Altman, K. Ebcioglu. \emph{DAISY Dynamic
Binary Translation Software.} Software Manual for DAISY Open Source
Release, 2000.\end{singlespace}

\end{thebibliography}

\end{document}

