Next: Bibliography Up: Appendices Previous: PTLsim uop Reference Contents

Subsections

Performance Counters

PTLsim maintains hundreds of performance and statistical counters and data points as it simulates user code. In Section 8, the basic mechanisms and data structures through which PTLsim collects these data were disclosed, and a guide to extending the existing set of collection points was presented.

This section is a reference listing of all the current performance counters present in PTLsim by default. The sections below are arranged in a hierarchical tree format, just as the data are represented in PTLsim's data store. The types of data collected closely match the performance counters available on modern Intel and AMD x86 processors, as described in their respective reference manuals.

General

As described in Section 8, PTLsim maintains a hierarchical tree of statistical data, defined in stats.h. The data store contains a potentially large number of snapshots of this tree, numbered starting at 0. The final snapshot, taken just before simulation completes, is labeled as ``final''. Each snapshot branch contains all of the data structures described in the next few sections. Snapshots are enabled with the -snapshot-cycles configuration option (Section 10.3); if they are disabled, only the ``0'' and ``final'' snapshots are provided.

Summary

The summary toplevel branch summarizes information about the simulation run across all cores:

summary: general information

cycles: total number of simulated cycles completed
insns: total number of complete x86 instructions committed
uops: total number of uops committed
basic_blocks: total number of basic blocks executed

snapshot_uuid: the universally unique ID (UUID) of this snapshot. This number starts from 0 and increases to infinity.

snapshot_name: name of this snapshot, if any. Named snapshots can be taken by the ptlcall_snapshot() call within the virtual machine, or by the -snapshot-now name command.

Simulator

The simulator toplevel branch represents information about PTLsim itself:

version: PTLsim version information

build_timestamp: the date and time PTLsim (specifically, ptlsim.o) was last built
svn_revision: Subversion revision number for this PTLsim version
svn_timestamp: Date and time of Subversion commit for this version
build_hostname: machine name on which PTLsim was compiled
build_compiler: gcc compiler version used to build PTLsim

run: runtime environment information

timestamp: time (in POSIX seconds-since-epoch format) this instance of PTLsim was started
hostname: machine name on which PTLsim is running
kernel_version: Linux kernel version PTLsim is running under. For PTLsim/X, this is the domain 0 kernel version
hypervisor_version:PTLsim/X Xen hypervisor version
executable: the executable file being run under simulation (userspace PTLsim only)
args: the arguments to the executable file (userspace PTLsim only)
native_cpuid: CPUID (brand/model/revision) of the host machine running PTLsim
native_hz: core frequency (cycles per second) of the host machine

config: the configuration options last passed to PTLsim for this run

performance: PTLsim internal performance data

rate: operations per wall-clock second (i.e. in outside world, not inside the virtual machine), averaged over entire run. These are the status lines PTLsim prints on the console and in the log file as it runs.
- cycles_per_second:simulated cycles completed per second
- issues_per_second:uops issued per second
- user_commits_per_second:x86 instructions committed per second

Decoder

The decoder toplevel branch represents the x86-to-uop decoder, basic block cache, code page cache and other common structures:

throughput: total decoded entities

basic_blocks: total basic blocks (uop sequence terminated by a branch) decoded
x86_insns: total x86 instructions decoded
uops: total uops produced from all decoded x86 instructions
bytes: total bytes in all decoded x86 instructions

bb_decode_type: predominant decoder type used for each basic block

all_insns_fast: number of basic blocks all instructions in the basic block were in the simple regular subset of x86 and could be decoded entirely by the fast decoder (decode-fast.cpp)
some_insns_complex:number of basic blocks in which one or more instructions required complex decoding

page_crossings: alignment of instructions within page

within_page: number of basic blocks in which all bytes in the basic block fell within a single page
crosses_page: number of basic blocks in which some bytes crossed a page boundary (i.e. required two MFN invalidate locators)

bbcache: basic block cache accesses

count: basic blocks currently in the cache (i.e. at the time the stats snapshot was made)
inserts: total insert operations
invalidates: invalidation operations by type
- smc: self modifying code required page to be invalidated
- dma: DMA into page with existing translations required page to be invalidated
- spurious: exec_page_faultassist determined the page has now been made executable
- reclaim: garbage collector discarded unused LRU basic blocks
- dirty: page was already dirty when new translation was to be made
- empty: page was empty (has no basic blocks)

pagecache: physical code page cache

count: physical pages currently in the cache (i.e. at the time the stats snapshot was made)
inserts: total physical page insert operations
invalidates: invalidation operations by type
- smc: self modifying code required page to be invalidated
- dma: DMA into page with existing translations required page to be invalidated
- spurious: exec_page_faultassist determined the page has now been made executable
- reclaim: garbage collector discarded unused LRU basic blocks
- dirty: page was already dirty when new translation was to be made
- empty: page was empty (has no basic blocks)

reclaim_rounds: number of times the memory manager attempted to reclaim unused basic blocks (possibly with several attempts until enough memory was available)

Out of Order Core

The out of order core is represented by the ooocore toplevel branch of the statistics data store tree:

cycles: total number of processor cycles simulated

fetch: fetch stage statistics

stop: totals up the reasons why fetching finally stopped in each cycle
- stalled: fetch unit was already stalled in the previous cycle
- icache_miss: an instruction cache miss prevented further fetches
- fetchq_full: the uop fetch queue is full
- bogus_rip: speculative execution redirected the fetch unit to an inaccessible (or non-executable) page. The fetch unit remains stalled in this state until the mis-speculation is resolved.
- microcode_assist:microcode assist must wait for pipeline to empty
- branch_taken: taken branches to non-sequential addresses always stop fetching
- full_width: the maximum fetch width was utilized without encountering any of the events above
opclass: histogram of how many uops of various operation classes passed through the fetch unit. The operation classes are defined in ptlhwdef.h and assigned to various opcodes in ptlhwdef.cpp.
width: histogram of the fetch width actually used on each cycle
blocks: blocks of x86 instructions fetched (typically the processor can read at most e.g. 16 bytes out of a 64 byte instruction cache line per cycle)
uops: total number of uops fetched
user_insns: total number of x86 instructions fetched

frontend: frontend pipeline (decode, allocate, rename) statistics

status: totals up the reasons why frontend processing finally stopped in each cycle
- complete: all uops were successfully allocated and renamed
- fetchq_empty: no more uops were available for allocation
- rob_full: reorder buffer (ROB) was full
- physregs_full: physical register file was full even though an ROB slot was free
- ldq_full: load queue was full (too many loads in the pipeline) even though physical registers were available
- stq_full: store queue was full (too many stores in the pipeline)
width: histogram of the frontend width actually used on each cycle
renamed: summarizes the type of renaming that occurred for each uop (of the destination, not the operands)
- none: uop did not rename its destination (primarily for stores and branches)
- reg: uop renamed destination architectural register
- flags: uop renamed one or more of the ZAPS, CF, OF flag sets but had no destination architectural register
- reg_and_flags: uop renamed one or more of the ZAPS, CF, OF flag sets as well as a destination architectural register
alloc: summarizes the type of resource allocation that occurred for each uop (in addition to its ROB slot):
- reg: uop was allocated a physical register
- ldreg: uop was a load and was allocated both a physical register and a load queue entry
- sfr: uop was a store and was allocated a store forwarding register (SFR), a.k.a. store queue entry
- br: uop was a branch and was allocated branch-related resources (possibly including a destination physical register)

dispatch: dispatch unit statistics

source: totals up where each operand to each uop currently resided at the time the uop was dispatched. These statistics are broken out by cluster.
- waiting: how many operands were waiting (i.e. not yet ready)
- bypass: how many operands would come from the bypass network if the uop were immediately issued
- physreg: how many operands were already written back to physical registers
- archreg: how many operands would be obtained from architectural registers
cluster: tracks the number of uops issued to each cluster (or issue queue) in the processor. This list will vary depending on the processor configuration. The value none means that no cluster could accept the uop because all issue queues were full.
redispatch: statistics on the redispatch speculation recovery rmechanism (Section 20.3.2)
- trigger_uops measures how many uops triggered redispatching because of a mis-speculation. This number does not count towards the statistics below.
- deadlock_flushes measures how many times the pipeline must be flushed to resolve a deadlock.
- dependent_uops is a histogram of how many uops depended on each trigger uop, not including the trigger uop itself.

issue: issue statistics

result: histogram of the final disposition of issuing each uop
- no-fu: no functional unit was available within the uop's assigned cluster even though it was already issued
- replay: uop attempted to execute but could not complete, so it must remain in the issue queue to be replayed. This event generally occurs when a load or store detects a previously unknown forwarding dependency on a prior store, when the data to actually store is not yet available, or when insufficient resources are available to complete the memory operation. Details are given in Sections 21 and 22.2.
- misspeculation: uop mis-speculated and now all uops after and including the issued uop must be annulled. This generally occurs with loads (Section 21) and stores (Section 22.2.1) when unaligned accesses or load-store aliasing occurs. This event is handled in accordance with Section 20.3.2.
- refetch: uop and all subsequent uops must be re-fetched to be decoded differently. For example, unaligned loads and stores take this path so they can be cracked into two parts after being refetched.
- branch_mispredict: uop was a branch and mispredicted, such that all uops after (but not including) the branch uop must be annulled. See Section 20 for details.
- exception: uop caused an exception (though this may not be a user visible error due to speculative execution)
- complete: uop completed successfully. Note that this does not mean the result is immediately ready; for loads it simply means the request was issued to the cache.
source: totals up where each operand to each uop was read from as it was issued
- bypass: how many operands came directly off the bypass network
- physreg: how many operands were read from physical registers
- archreg: how many operands were read from committed architectural registers
width: histogram of the issue width actually used on each cycle in each cluster. This object is further broken down by cluster, since various clusters have different issue width and policies.
opclass: histogram of how many uops of various operation classes were issued. The operation classes are defined in ptlhwdef.h and assigned to various opcodes in ptlhwdef.cpp.

writeback: writeback stage statistics

total_writebacks: total number of results written back to the physical register file
transient: transient versus persistent values
- transient: the result technically does not have to be written back to the physical register file at all, since all consumers sourced the value off the bypass network and the result is no longer available since the destination architectural register pointing to it has since been renamed.
- persistent: all values which do not meet the conditions above and hence must still be written back
width: histogram of the writeback width actually used on each cycle in each cluster. This object is further broken down by cluster, since various clusters have different issue width and policies.

commit: commit unit statistics

uops: total number of uops committed
insns: total number of complete x86 instructions committed
result: histogram of the final disposition of attempting to commit each uop
- none: one or more uops comprising the x86 instruction at the head of the ROB were not yet ready to commit, so commitment is terminated for that cycle
- ok: result was successfully committed
- exception: result caused a genuine user visible exception. In userspace PTLsim, this will terminate the simulation. In full system PTLsim/X, this is a normal and frequent event. Floating point state dirty faults are counted under this category.
- skipblock: This occurs in rare cases when the processor must skip over the currently executing instruction (such as in pathological cases of the rep x86 instructions).
- barrier: the processor encountered a barrier instruction, such as a system call, assist or pipeline flush. The frontend has already been stopped and fetching has been redirected to the code to handle the barrier; this condition simply commits the barrier instruction itself.
- smc: self modifying code: the instruction attempting to commit has been modified since it was last decoded (see Section 6.4)
- stop: special case for when the simulation is to be stopped after committing a certain number of x86 instructions (e.g. via the -stopinsns option in Section 10.3).
setflags: how many uops updated the condition code flags as they committed
- yes: how many uops updated at least one of the ZAPS, CF, OF flag sets (the REG_flags internal architectural register)
- no: how many uops did not update any flags
freereg: how many uops were able to free the old physical register mapped to their architectural destination register at commit time
- pending: old physical register was still referenced within the pipeline or by one or more rename table entries
- free: old physical register could be immediately freed
free_regs_recycled: how many physical registers were recycled (garbage collected) later than normal because of one of the conditions above
width: histogram of the issue width actually used on each cycle in each cluster. This object is further broken down by cluster, since various clusters have different issue width and policies.
opclass: histogram of how many uops of various operation classes were issued. The operation classes are defined in ptlhwdef.h and assigned to various opcodes in ptlhwdef.cpp.

branchpred: branch predictor statistics

predictions: total number of branch predictions of any type
updates: total number of branch predictor updates of any type
cond: conditional branch (br.ccuop) prediction outcomes, broken down into correct predictions and mispredictions
indir: indirect branch (jmpuop) prediction outcomes, broken down into correct predictions and mispredictions
return: return (jmpuop with BRANCH_HINT_RETflag) prediction outcomes, broken down into correct predictions and mispredictions
summary: summary of all prediction outcomes of the three types above, broken down into correct predictions and mispredictions
ras: return address stack (RAS) operations
- push: RAS pushes on calls
- push_overflows: RAS pushes on calls in which the RAS overflowed
- pop: RAS pops on returns
- pop_underflows: RAS pops on returns in which the RAS was empty
- annuls: annulment operations in which speculative updates to the RAS were rolled back

Cache Subsystem

The cache subsystem is listed under the ooocore/dcache branch.

load: load unit statistics

issue: histogram of the final disposition of issuing each load uop
- complete: cache hit
- miss: L1 cache miss, and possibly lower levels as well (Sections 21.4 and 25.2)
- exception: load generated an exception (typically a page fault), although the exception may still be speculative (Section 21)
- ordering: load was misordered with respect to stores (Section 22.2.1)
- unaligned: load was unaligned and will need to be re-executed as a pair of low and high loads (Sections 5.6 and 21)
- replay: histogram of events in which a load needed to be replayed (Section 21)
  - sfr-addr-and-data-not-ready:load was predicted to forward data from a prior store (Section 22.2.1), but neither the address nor the data of that store has resolved yet
  - sfr-addr-not-ready:load was predicted to forward data from a prior store, but the address of that store has not resolved yet
  - sfr-data-not-ready:load address matched a prior store in the store queue, but the data that store should write has not resolved yet
  - missbuf-full: load missed the cache but the miss buffer and/or LFRQ (Section 25.2) was full at the time
hit: histogram of the cache hierarchy level each load finally hit
- L1: L1 cache hit
- L2: L1 cache miss, L2 cache hit
- L3: L1 and L2 cache miss, L3 cache hit
- mem: all caches missed; value read from main memory
forward: histogram of which sources were used to fill each load
- cache: how many loads obtained all their data from the cache
- sfr: how many loads obtained all their data from a prior store in the pipeline (i.e. load completely overlapped that store)
- sfr-and-cache: how many loads obtained their data from a combination of the cache and a prior store
dependency: histogram of how loads related to previous stores
- independent: load was independent of any store currently in the pipeline
- predicted-alias-unresolved:load was stalled because the load store alias predictor (LSAP) predicted that an earlier store would overlap the load's address address even though that earlier store's address was unresolved (Section 22.2.1)
- stq-address-match:load depended on an earlier store still found in the store queue
type: histogram of the type of each load uop
- aligned: normal aligned loads
- unaligned: special unaligned load uops ld.lo or ld.hi (Section 5.6)
- internal: loads from PTLsim space by microcode
size: histogram of the size in bytes of each load uop
transfer-L2-to-L1:histogram of the types of L2 to L1 line transfers that occurred (Section 25)
- full-L2-to-L1: all bytes in cache line were transferred from L2 to L1 cache
- partial-L2-to-L1: some bytes in the L1 line were already valid (because of stores to those bytes), but the remaining bytes still need to be fetched
- L2-to-L1I: all bytes in the L2 line were transferred into the L1 instruction cache
dtlb: data cache translation lookaside buffer hit versus miss rate (Section 25.4)

fetch: instruction fetch unit statistics (Section 17.1)

hit: histogram of the cache hierarchy level each fetch finally hit
- L1: L1 cache hit
- L2: L1 cache miss, L2 cache hit
- L3: L1 and L2 cache miss, L3 cache hit
- mem: all caches missed; value read from main memory
itlb: instruction cache translation lookaside buffer hit versus miss rate (Section 25.4)

prefetches: prefetch engine statistics

in-L1: requested data already in L1 cache
in-L2: requested data already in L2 cache (and possibly also in L1 cache)
required: prefetch was actually required (data was not cached or was in L3 or lower levels)

missbuf: miss buffer performance (Sections 25.2 and 25.3)

inserts: total number of lines inserted into the miss buffer
delivers: total number of lines delivered to various cache hierarchy levels from the miss buffer
- mem-to-L3: deliver line from main memory to the L3 cache
- L3-to-L2: deliver line to the L3 cache to the L2 cache
- L2-to-L1D: deliver line from the L2 cache to the L1 data cache
- L2-to-L1I: deliver line from the L2 cache to the L1 instruction cache

lfrq: load fill request queue (LFRQ) performance (Sections 25.2 and 25.3)

inserts: total number of loads inserted into the LFRQ
wakeups: total number of loads awakened from the LFRQ
annuls: total number of loads annulled in the LFRQ (after they were annulled in the processor core)
resets: total number of LFRQ resets (all entries cleared)
total-latency: total latency in cycles of all loads passing through the LFRQ
average-miss-latency:average load latency, weighted by cache level hit and latency to that level
width: histogram of how many loads were awakened per cycle by the LFRQ

store: store unit statistics

issue: histogram of the final disposition of issuing each store uop
- complete: store completed without problems
- exception: store generated an exception (typically a page fault), although the exception may still be speculative (Section 22.1)
- ordering: store detected that a later load in program order aliased the store but was issued earlier than the store (Section 22.2.1)
- unaligned: store was unaligned and will need to be re-executed as a pair of low and high stores (Sections 5.6)
- replay: histogram of events in which a store needed to be replayed (Sections 22.2 and 22.1)
  - wait-sfraddr-sfrdata:neither the address nor the data of a prior store this store inherits some of its data from was ready
  - wait-sfraddr: the data of a prior store was ready but its address was still unavailable
  - wait-sfrdata: the address of a prior store was ready but its data was still unavailable
  - wait-storedata-sfraddr-sfrdata:the actual data value to store was not ready (Section 22.2), in addition to having neither the data nor the address of a prior store (Section 22.1)
  - wait-storedata-sfraddr:the actual data value to store was not ready (Section 22.2), in addition to not having the address of the prior store (Section 22.1)
  - wait-storedata-sfrdata:the actual data value to store was not ready (Section 22.2), in addition to not having the data from the prior store (Section 22.1)
forward: histogram of which sources were used to construct the merged store buffer:
- zero: no prior store overlapping the current store was found in the pipeline
- sfr: data from a prior store in the pipeline was merged with the value to be stored to form the final store buffer
type: histogram of the type of each store uop
- aligned: normal aligned store
- unaligned: special unaligned store uops st.lo or st.hi (Section 5.6)
- internal: stores to PTLsim space by microcode
size: histogram of the size in bytes of each store uop
commit: histogram of how stores are committed
- direct: store committed directly to the data cache in the commit stage (Section 24)
commits: total number of committed uops
usercommits: total number of committed x86 instructions
issues: total number of uops issued. This includes uops issued more than once by through replay (Section 19.3).
ipc: Instructions Per Cycle (IPC) statistics
- commit-in-uops: average number of uops committed per cycle
- issue-in-uops: average number of uops issued per cycle
- commit-in-user-insns:average number of x86 instructions committed per cycle
  
  NOTE: Because one x86 instruction may be broken up into numerous uops, it is never appropriate to compare IPC figures for committed x86 instructions per clock with IPC values from a RISC machine. Furthermore, different x86 implementations use varying numbers of uops per x86 instruction as a matter of encoding, so even comparing the uop based IPC between x86 implementations or RISC-like machines is inaccurate. Users are strongly advised to use relative performance measures instead (e.g. total cycles taken to complete a given benchmark).

simulator: describes the performance of PTLsim itself. Useful for tuning the simulator.

total_time: total time in seconds (not simulated cycles!) spent in various parts of the simulator. Please refer to the source code (in ooocore.cpp) for the range of code each time value corresponds to.
cputime: PTLsim simulator performance
- fetch: seconds spent in fetch stage
- decode: seconds spent decoding instructions (in decoder subsystem)
- rename: seconds spent in allocate and rename stage
- frontend: seconds spent in frontend stages
- dispatch: seconds spent in dispatch stage
- issue: seconds spent in ALU issue stage, not including loads and stores
- issueload: seconds spent issuing loads
- issuestore: seconds spent issuing stores
- complete: seconds spent in completion stage
- transfer: seconds spent in transfer stage
- writeback: seconds spent in writeback stage
- commit: seconds spent in commit stage

External Events

assists: histogram of microcode assists invoked from any core
traps: histogram of x86 interrupt vectors (traps) invoked from any core (PTLsim/X only)

Next: Bibliography Up: Appendices Previous: PTLsim uop Reference Contents

Matt T Yourst 2007-09-26