Next: Bibliography
Up: Appendices
Previous: PTLsim uop Reference
Contents
Subsections
Performance Counters
PTLsim maintains hundreds of performance and statistical counters
and data points as it simulates user code. In Section 8,
the basic mechanisms and data structures through which PTLsim collects
these data were disclosed, and a guide to extending the existing set
of collection points was presented.
This section is a reference listing of all the current performance
counters present in PTLsim by default. The sections below are arranged
in a hierarchical tree format, just as the data are represented in
PTLsim's data store. The types of data collected closely match the
performance counters available on modern Intel and AMD x86 processors,
as described in their respective reference manuals.
As described in Section 8, PTLsim
maintains a hierarchical tree of statistical data, defined in stats.h.
The data store contains a potentially large number of snapshots of
this tree, numbered starting at 0. The final snapshot, taken just
before simulation completes, is labeled as ``final''. Each snapshot
branch contains all of the data structures described in the next few
sections. Snapshots are enabled with the -snapshot-cycles
configuration option (Section 10.3); if
they are disabled, only the ``0'' and ``final'' snapshots
are provided.
The summary toplevel branch
summarizes information about the simulation run across all cores:
summary: general information
- cycles: total number
of simulated cycles completed
- insns: total number
of complete x86 instructions committed
- uops: total number
of uops committed
- basic_blocks: total
number of basic blocks executed
snapshot_uuid: the
universally unique ID (UUID) of this snapshot. This number starts
from 0 and increases to infinity.
snapshot_name: name
of this snapshot, if any. Named snapshots can be taken by the ptlcall_snapshot()
call within the virtual machine, or by the -snapshot-now
name command.
The simulator toplevel branch
represents information about PTLsim itself:
version: PTLsim version
information
- build_timestamp: the
date and time PTLsim (specifically, ptlsim.o)
was last built
- svn_revision: Subversion
revision number for this PTLsim version
- svn_timestamp: Date
and time of Subversion commit for this version
- build_hostname: machine
name on which PTLsim was compiled
- build_compiler: gcc
compiler version used to build PTLsim
run: runtime environment
information
- timestamp: time (in
POSIX seconds-since-epoch format) this instance of PTLsim was started
- hostname: machine
name on which PTLsim is running
- kernel_version: Linux
kernel version PTLsim is running under. For PTLsim/X, this is the
domain 0 kernel version
- hypervisor_version:PTLsim/X Xen hypervisor version
- executable: the executable
file being run under simulation (userspace PTLsim only)
- args: the arguments
to the executable file (userspace PTLsim only)
- native_cpuid: CPUID
(brand/model/revision) of the host machine running PTLsim
- native_hz: core frequency
(cycles per second) of the host machine
config: the configuration
options last passed to PTLsim for this run
performance: PTLsim
internal performance data
- rate: operations per
wall-clock second (i.e. in outside world, not inside the virtual machine),
averaged over entire run. These are the status lines PTLsim prints
on the console and in the log file as it runs.
- cycles_per_second:simulated cycles completed per second
- issues_per_second:uops issued per second
- user_commits_per_second:x86 instructions committed per second
The decoder toplevel branch represents the
x86-to-uop decoder, basic block cache, code page cache and other common
structures:
throughput: total decoded entities
- basic_blocks: total basic blocks
(uop sequence terminated by a branch) decoded
- x86_insns: total
x86 instructions decoded
- uops: total uops produced
from all decoded x86 instructions
- bytes: total bytes
in all decoded x86 instructions
bb_decode_type: predominant
decoder type used for each basic block
- all_insns_fast: number
of basic blocks all instructions in the basic block were in the simple
regular subset of x86 and could be decoded entirely by the fast decoder
(decode-fast.cpp)
- some_insns_complex:number of basic blocks in which one or more instructions required
complex decoding
page_crossings: alignment
of instructions within page
- within_page: number
of basic blocks in which all bytes in the basic block fell within
a single page
- crosses_page: number
of basic blocks in which some bytes crossed a page boundary (i.e.
required two MFN invalidate locators)
bbcache: basic block
cache accesses
- count: basic blocks
currently in the cache (i.e. at the time the stats snapshot was made)
- inserts: total insert
operations
- invalidates: invalidation
operations by type
- smc: self modifying
code required page to be invalidated
- dma: DMA into page
with existing translations required page to be invalidated
- spurious: exec_page_faultassist determined the page has now been made executable
- reclaim: garbage collector
discarded unused LRU basic blocks
- dirty: page was already
dirty when new translation was to be made
- empty: page was empty
(has no basic blocks)
pagecache: physical
code page cache
- count: physical pages
currently in the cache (i.e. at the time the stats snapshot was made)
- inserts: total physical
page insert operations
- invalidates: invalidation
operations by type
- smc: self modifying
code required page to be invalidated
- dma: DMA into page
with existing translations required page to be invalidated
- spurious: exec_page_faultassist determined the page has now been made executable
- reclaim: garbage collector
discarded unused LRU basic blocks
- dirty: page was already
dirty when new translation was to be made
- empty: page was empty
(has no basic blocks)
reclaim_rounds: number
of times the memory manager attempted to reclaim unused basic blocks
(possibly with several attempts until enough memory was available)
The out of order core is represented by the ooocore
toplevel branch of the statistics data store tree:
cycles: total number of processor
cycles simulated
fetch: fetch stage statistics
- stop: totals up the reasons why fetching
finally stopped in each cycle
- stalled: fetch unit
was already stalled in the previous cycle
- icache_miss: an instruction
cache miss prevented further fetches
- fetchq_full: the
uop fetch queue is full
- bogus_rip: speculative
execution redirected the fetch unit to an inaccessible (or non-executable)
page. The fetch unit remains stalled in this state until the mis-speculation
is resolved.
- microcode_assist:microcode assist must wait for pipeline to empty
- branch_taken: taken branches to non-sequential
addresses always stop fetching
- full_width: the maximum fetch width
was utilized without encountering any of the events above
- opclass: histogram
of how many uops of various operation classes passed through the fetch
unit. The operation classes are defined in ptlhwdef.h
and assigned to various opcodes in ptlhwdef.cpp.
- width: histogram of the fetch width
actually used on each cycle
- blocks: blocks of x86 instructions
fetched (typically the processor can read at most e.g. 16 bytes out
of a 64 byte instruction cache line per cycle)
- uops: total number of uops fetched
- user_insns: total number of x86 instructions
fetched
frontend: frontend pipeline (decode,
allocate, rename) statistics
- status: totals up the reasons why
frontend processing finally stopped in each cycle
- complete: all uops were successfully
allocated and renamed
- fetchq_empty: no more uops were available
for allocation
- rob_full: reorder buffer (ROB) was
full
- physregs_full: physical register
file was full even though an ROB slot was free
- ldq_full: load queue was full (too
many loads in the pipeline) even though physical registers were available
- stq_full: store queue was full (too
many stores in the pipeline)
- width: histogram of the frontend width
actually used on each cycle
- renamed: summarizes the type of renaming
that occurred for each uop (of the destination, not the operands)
- none: uop did not rename its destination
(primarily for stores and branches)
- reg: uop renamed destination architectural
register
- flags: uop renamed one or more of
the ZAPS, CF, OF flag sets but had no destination architectural register
- reg_and_flags: uop renamed one or
more of the ZAPS, CF, OF flag sets as well as a destination architectural
register
- alloc: summarizes the type of resource
allocation that occurred for each uop (in addition to its ROB slot):
- reg: uop was allocated a physical
register
- ldreg: uop was a load and was allocated
both a physical register and a load queue entry
- sfr: uop was a store and was allocated
a store forwarding register (SFR), a.k.a. store queue entry
- br: uop was a branch and was allocated
branch-related resources (possibly including a destination physical
register)
dispatch: dispatch unit statistics
- source: totals up where each operand
to each uop currently resided at the time the uop was dispatched.
These statistics are broken out by cluster.
- waiting: how many operands were waiting
(i.e. not yet ready)
- bypass: how many operands would come
from the bypass network if the uop were immediately issued
- physreg: how many operands were already
written back to physical registers
- archreg: how many operands would be
obtained from architectural registers
- cluster: tracks the number of uops
issued to each cluster (or issue queue) in the processor. This list
will vary depending on the processor configuration. The value none
means that no cluster could accept the uop because all issue queues
were full.
- redispatch: statistics on the redispatch
speculation recovery rmechanism (Section 20.3.2)
- trigger_uops measures how many uops
triggered redispatching because of a mis-speculation. This number
does not count towards the statistics below.
- deadlock_flushes measures how many
times the pipeline must be flushed to resolve a deadlock.
- dependent_uops is a histogram of
how many uops depended on each trigger uop, not including the trigger
uop itself.
issue: issue statistics
- result: histogram of the final disposition
of issuing each uop
- no-fu: no functional unit was available
within the uop's assigned cluster even though it was already issued
- replay: uop attempted to execute but
could not complete, so it must remain in the issue queue to be replayed.
This event generally occurs when a load or store detects a previously
unknown forwarding dependency on a prior store, when the data to actually
store is not yet available, or when insufficient resources are available
to complete the memory operation. Details are given in Sections 21
and 22.2.
- misspeculation: uop mis-speculated
and now all uops after and including the issued uop must be annulled.
This generally occurs with loads (Section 21)
and stores (Section 22.2.1) when unaligned accesses
or load-store aliasing occurs. This event is handled in accordance
with Section 20.3.2.
- refetch: uop and all subsequent uops
must be re-fetched to be decoded differently. For example, unaligned
loads and stores take this path so they can be cracked into two parts
after being refetched.
- branch_mispredict: uop was a branch
and mispredicted, such that all uops after (but not including) the
branch uop must be annulled. See Section 20
for details.
- exception: uop caused an exception
(though this may not be a user visible error due to speculative execution)
- complete: uop completed successfully.
Note that this does not mean the result is immediately ready;
for loads it simply means the request was issued to the cache.
- source: totals up where each operand
to each uop was read from as it was issued
- bypass: how many operands came directly
off the bypass network
- physreg: how many operands were read
from physical registers
- archreg: how many operands were read
from committed architectural registers
- width: histogram of the issue width
actually used on each cycle in each cluster. This object is further
broken down by cluster, since various clusters have different issue
width and policies.
- opclass: histogram of how many uops
of various operation classes were issued. The operation classes are
defined in ptlhwdef.h and assigned to various
opcodes in ptlhwdef.cpp.
writeback: writeback stage statistics
- total_writebacks: total number of
results written back to the physical register file
- transient: transient versus persistent
values
- transient: the result technically
does not have to be written back to the physical register file at
all, since all consumers sourced the value off the bypass network
and the result is no longer available since the destination architectural
register pointing to it has since been renamed.
- persistent: all values which do not
meet the conditions above and hence must still be written back
- width: histogram of the writeback
width actually used on each cycle in each cluster. This object is
further broken down by cluster, since various clusters have different
issue width and policies.
commit: commit unit statistics
- uops: total number of uops committed
- insns: total number of complete x86
instructions committed
- result: histogram of the final disposition
of attempting to commit each uop
- none: one or more uops comprising
the x86 instruction at the head of the ROB were not yet ready to commit,
so commitment is terminated for that cycle
- ok: result was successfully committed
- exception: result caused a genuine
user visible exception. In userspace PTLsim, this will terminate the
simulation. In full system PTLsim/X, this is a normal and frequent
event. Floating point state dirty faults are counted under this category.
- skipblock: This occurs in rare cases
when the processor must skip over the currently executing instruction
(such as in pathological cases of the rep x86
instructions).
- barrier: the processor encountered
a barrier instruction, such as a system call, assist or pipeline flush.
The frontend has already been stopped and fetching has been redirected
to the code to handle the barrier; this condition simply commits the
barrier instruction itself.
- smc: self modifying code: the instruction
attempting to commit has been modified since it was last decoded (see
Section 6.4)
- stop: special case for when the simulation
is to be stopped after committing a certain number of x86 instructions
(e.g. via the -stopinsns option in Section
10.3).
- setflags: how many uops updated the
condition code flags as they committed
- yes: how many uops updated at least
one of the ZAPS, CF, OF flag sets (the REG_flags
internal architectural register)
- no: how many uops did not update any
flags
- freereg: how many uops were able to
free the old physical register mapped to their architectural destination
register at commit time
- pending: old physical register was
still referenced within the pipeline or by one or more rename table
entries
- free: old physical register could
be immediately freed
- free_regs_recycled: how many physical
registers were recycled (garbage collected) later than normal because
of one of the conditions above
- width: histogram of the issue width
actually used on each cycle in each cluster. This object is further
broken down by cluster, since various clusters have different issue
width and policies.
- opclass: histogram of how many uops
of various operation classes were issued. The operation classes are
defined in ptlhwdef.h and assigned to various
opcodes in ptlhwdef.cpp.
branchpred: branch predictor statistics
- predictions: total number of branch
predictions of any type
- updates: total number of branch predictor
updates of any type
- cond: conditional branch (br.ccuop) prediction outcomes, broken down into correct predictions and
mispredictions
- indir: indirect branch (jmpuop) prediction outcomes, broken down into correct predictions and
mispredictions
- return: return (jmpuop with BRANCH_HINT_RETflag) prediction outcomes, broken down into correct predictions and
mispredictions
- summary: summary of
all prediction outcomes of the three types above, broken down into
correct predictions and mispredictions
- ras: return address
stack (RAS) operations
- push: RAS pushes on
calls
- push_overflows: RAS
pushes on calls in which the RAS overflowed
- pop: RAS pops on returns
- pop_underflows: RAS
pops on returns in which the RAS was empty
- annuls: annulment
operations in which speculative updates to the RAS were rolled back
The cache subsystem is listed under the ooocore/dcache
branch.
load: load unit statistics
- issue: histogram of the final disposition
of issuing each load uop
- complete: cache hit
- miss: L1 cache miss, and possibly
lower levels as well (Sections 21.4 and 25.2)
- exception: load generated an exception
(typically a page fault), although the exception may still be speculative
(Section 21)
- ordering: load was misordered with
respect to stores (Section 22.2.1)
- unaligned: load was unaligned and
will need to be re-executed as a pair of low and high loads (Sections
5.6 and 21)
- replay: histogram of events in which
a load needed to be replayed (Section 21)
- sfr-addr-and-data-not-ready:load was predicted to forward data from a prior store (Section 22.2.1),
but neither the address nor the data of that store has resolved yet
- sfr-addr-not-ready:load was predicted to forward data from a prior store, but the address
of that store has not resolved yet
- sfr-data-not-ready:load address matched a prior store in the store queue, but the data
that store should write has not resolved yet
- missbuf-full: load
missed the cache but the miss buffer and/or LFRQ (Section 25.2)
was full at the time
- hit: histogram of
the cache hierarchy level each load finally hit
- L1: L1 cache hit
- L2: L1 cache miss,
L2 cache hit
- L3: L1 and L2 cache
miss, L3 cache hit
- mem: all caches missed;
value read from main memory
- forward: histogram
of which sources were used to fill each load
- cache: how many loads
obtained all their data from the cache
- sfr: how many loads
obtained all their data from a prior store in the pipeline (i.e. load
completely overlapped that store)
- sfr-and-cache: how
many loads obtained their data from a combination of the cache and
a prior store
- dependency: histogram
of how loads related to previous stores
- independent: load
was independent of any store currently in the pipeline
- predicted-alias-unresolved:load was stalled because the load store alias predictor (LSAP) predicted
that an earlier store would overlap the load's address address even
though that earlier store's address was unresolved (Section 22.2.1)
- stq-address-match:load depended on an earlier store still found in the store queue
- type: histogram of
the type of each load uop
- aligned: normal aligned
loads
- unaligned: special
unaligned load uops ld.lo or ld.hi
(Section 5.6)
- internal: loads from
PTLsim space by microcode
- size: histogram of
the size in bytes of each load uop
- transfer-L2-to-L1:histogram of the types of L2 to L1 line transfers that occurred (Section
25)
- full-L2-to-L1: all
bytes in cache line were transferred from L2 to L1 cache
- partial-L2-to-L1: some
bytes in the L1 line were already valid (because of stores to those
bytes), but the remaining bytes still need to be fetched
- L2-to-L1I: all bytes
in the L2 line were transferred into the L1 instruction cache
- dtlb: data cache translation
lookaside buffer hit versus miss rate (Section 25.4)
fetch: instruction
fetch unit statistics (Section 17.1)
- hit: histogram of
the cache hierarchy level each fetch finally hit
- L1: L1 cache hit
- L2: L1 cache miss,
L2 cache hit
- L3: L1 and L2 cache
miss, L3 cache hit
- mem: all caches missed;
value read from main memory
- itlb: instruction
cache translation lookaside buffer hit versus miss rate (Section 25.4)
prefetches: prefetch
engine statistics
- in-L1: requested data
already in L1 cache
- in-L2: requested data
already in L2 cache (and possibly also in L1 cache)
- required: prefetch
was actually required (data was not cached or was in L3 or lower levels)
missbuf: miss buffer
performance (Sections 25.2 and 25.3)
- inserts: total number
of lines inserted into the miss buffer
- delivers: total number
of lines delivered to various cache hierarchy levels from the miss
buffer
- mem-to-L3: deliver
line from main memory to the L3 cache
- L3-to-L2: deliver
line to the L3 cache to the L2 cache
- L2-to-L1D: deliver
line from the L2 cache to the L1 data cache
- L2-to-L1I: deliver
line from the L2 cache to the L1 instruction cache
lfrq: load fill request
queue (LFRQ) performance (Sections 25.2
and 25.3)
- inserts: total number
of loads inserted into the LFRQ
- wakeups: total number
of loads awakened from the LFRQ
- annuls: total number
of loads annulled in the LFRQ (after they were annulled in the processor
core)
- resets: total number
of LFRQ resets (all entries cleared)
- total-latency: total
latency in cycles of all loads passing through the LFRQ
- average-miss-latency:average load latency, weighted by cache level hit and latency to
that level
- width: histogram of
how many loads were awakened per cycle by the LFRQ
store: store unit
statistics
- issue: histogram of
the final disposition of issuing each store uop
- complete: store completed
without problems
- exception: store generated
an exception (typically a page fault), although the exception may
still be speculative (Section 22.1)
- ordering: store detected
that a later load in program order aliased the store but was issued
earlier than the store (Section 22.2.1)
- unaligned: store was
unaligned and will need to be re-executed as a pair of low and high
stores (Sections 5.6)
- replay: histogram
of events in which a store needed to be replayed (Sections 22.2
and 22.1)
- wait-sfraddr-sfrdata:neither the address nor the data of a prior store this store inherits
some of its data from was ready
- wait-sfraddr: the
data of a prior store was ready but its address was still unavailable
- wait-sfrdata: the
address of a prior store was ready but its data was still unavailable
- wait-storedata-sfraddr-sfrdata:the actual data value to store was not ready (Section 22.2),
in addition to having neither the data nor the address of a prior
store (Section 22.1)
- wait-storedata-sfraddr:the actual data value to store was not ready (Section 22.2),
in addition to not having the address of the prior store (Section
22.1)
- wait-storedata-sfrdata:the actual data value to store was not ready (Section 22.2),
in addition to not having the data from the prior store (Section 22.1)
- forward: histogram
of which sources were used to construct the merged store buffer:
- zero: no prior store
overlapping the current store was found in the pipeline
- sfr: data from a prior
store in the pipeline was merged with the value to be stored to form
the final store buffer
- type: histogram of
the type of each store uop
- aligned: normal aligned
store
- unaligned: special
unaligned store uops st.lo or st.hi
(Section 5.6)
- internal: stores to
PTLsim space by microcode
- size: histogram of
the size in bytes of each store uop
- commit: histogram
of how stores are committed
- direct: store committed
directly to the data cache in the commit stage (Section 24)
- commits: total number
of committed uops
- usercommits: total
number of committed x86 instructions
- issues: total number
of uops issued. This includes uops issued more than once by through
replay (Section 19.3).
- ipc: Instructions
Per Cycle (IPC) statistics
- commit-in-uops: average
number of uops committed per cycle
- issue-in-uops: average
number of uops issued per cycle
- commit-in-user-insns:average number of x86 instructions committed per cycle
NOTE: Because one x86 instruction may be broken up
into numerous uops, it is never appropriate
to compare IPC figures for committed x86 instructions per clock with
IPC values from a RISC machine. Furthermore, different x86 implementations
use varying numbers of uops per x86 instruction as a matter of encoding,
so even comparing the uop based IPC between x86 implementations or
RISC-like machines is inaccurate. Users are strongly advised to use
relative performance measures instead (e.g. total cycles taken to
complete a given benchmark).
simulator: describes
the performance of PTLsim itself. Useful for tuning the simulator.
- total_time: total
time in seconds (not simulated cycles!) spent in various parts
of the simulator. Please refer to the source code (in ooocore.cpp)
for the range of code each time value corresponds to.
- cputime: PTLsim simulator
performance
- fetch: seconds spent in fetch stage
- decode: seconds spent decoding instructions
(in decoder subsystem)
- rename: seconds spent in allocate
and rename stage
- frontend: seconds spent in frontend
stages
- dispatch: seconds spent in dispatch
stage
- issue: seconds spent in ALU issue
stage, not including loads and stores
- issueload: seconds spent issuing loads
- issuestore: seconds spent issuing
stores
- complete: seconds spent in completion
stage
- transfer: seconds spent in transfer
stage
- writeback: seconds spent in writeback
stage
- commit: seconds spent in commit stage
- assists: histogram
of microcode assists invoked from any core
- traps: histogram of x86 interrupt vectors (traps) invoked
from any core (PTLsim/X only)
Next: Bibliography
Up: Appendices
Previous: PTLsim uop Reference
Contents
Matt T Yourst
2007-09-26