A Comparative Analysis of Branch Prediction Schemes

Zhendong Su and Min Zhou

Computer Science Division
University of California at Berkeley
Berkeley, CA 94720


Design Methodology

First, we briefly describe the experimental system and test benchmark programs we use. Second, we describe the schemes that we implemented and tested. Third, we discuss how we chose the buffer organization and related parameters. Finally, we discuss the issues regarding the number of instructions being traced.

1. Experimental System

Our experiment is conducted on SPARC system 10 with two SuperSparc/60 V8 micro-processors. We compiled c benchmark programs using SUNSoft "cc" version 3.0.1, and compiled fortran programs using SUNSoft "f77" version 3.0.1.

Our data is obtained using "Shade"[SHADE] version 5.15 analyzing programs. Shade is a dynamic code tracer. It links instruction set simulation, trace generation with custom trace analysis. The first advantage of a shade-based simulator over other static trace-based simulators such as pixie based approach is its fast running speed. Shade tends to run fast mainly because the executable being traced, the shade trace generator, and the shade analyzer are all in one single process. The second advantage is that it combines trace generation and trace based analyzer/simulator, thereby avoiding awkward trace file manipulations.

The following figure illustrates the code structure of our shade-based branch prediction simulators.

Figure 1. Shade-Based Branch Prediction Scheme Simulator

The main function is analyze() which is invoked for each traced instruction specified in shade_main(). Provided with the pc - program counter, ea - effective address, and taken/untaken information, we implemented different branch prediction scheme simulators. In analyze(), we also implemented a simple profiler and an execution controller. The profiler gathers the program branch behavior information. The execution controller controls the dynamic execution size. In addition, it controls the time to flush branch history tables when simulating context switching.

2. Benchmark Programs

In different applications and programming languages, conditional branches behave differently. It is important to have a set of benchmark programs that can give a good approximation of the average workload and complexity of the programs that users run. Previous work in this area has been done by tracing the execution of some benchmark programs. In this project, we also use instruction tracing data to measure the performance of different branch prediction schemes. Eight benchmark programs from the beta version of SPEC95 integer program suite and thirteen benchmarks from the SPEC92 floating point program suite are used in this study. Table 1 and 2 list the benchmark programs, their abbreviations that we use, and the testing input data sets used in our experiment.

SPEC95 Integer Program Beta
Benchmark / Input Dynamic Inst. Dynamic Cond. Branch Program / Input size Static Cond. Branch
gcc / 1amptjp.i 1297M 221M 1697K / 222K 19598
gcc / 1c-decl-s.i 1297M 221M 1697K / 222K 19603
gcc / 1dbxout.i 1664M 28M 1697K / 42K 15455
gcc / 1reload1.i 992M 173M 1697K / 148K 19673
gcc / cccp.i 1298M 223M 1697K / 162K 19514
gcc / insn-emit.i 147M 23M 1697K / 48K 10815
gcc / stmt-protoize.i 986M 165M 1697K / 185K 19746
ghost / convolution.ps-color 1400M 238M 584K / 218K 4262
ghost / convolution.ps-mono 1342M 229M 584K / 218K 4312
ghost / convolution.ps-tiff 1315M 222M 584K / 218K 4330
go / restart.in* 4535M 500M 390K / 5761
go / neardone.in 733M 78M 390K / 4874
go / null.in* 4531M 500M 390K / 5742
m88ksim / dcrand.in* 7007M 1000M 389K / 66K 824
numi / numi.in* 7507M 1000M 31K / 1064
li/li-input.lsp* 12355M 1000M 299K / 1412
perl / jumble.perl* 3471M 500M 400K / 2523
perl / primes.perl* 4762M 500M 400K / 2218
vortex / vortex.in* 7195M 1000M 867K / 7602

Table 1. SPEC95 Integer Program Beta and Input Data Description.

( All programs are listed in alphabetical order. Entries with * denote programs that were interrupted in the middle of tracing. The number of static conditional branches is the number of different static conditional branches traced. Only numi is a Fortran program. All the other programs are written in C. The input data sizes of gcc, ghost and m88ksim are related to the tracing sizes.)

SPEC92 Floating Point Program
Benchmark Dynamic Inst. Dynamic Cond. Branch Program size Static Cond. Branch
alvinn 6792M 480M 9612 1032
doduc 1644M 87M 247K 2330
ear 14506M 705M 59K 1238
fpppp 8463M 106M 138K 1332
hydro2d 6627M 680M 111K 2356
mdljdp2 4206M 309M 79K 1458
mdljsp2 3011M 338M 98K 1520
nasa7 11104M 217M 91K 1889
ora 2029M 158M 24K 1153
su2cor 8055M 165M 150K 1863
swm256 9862M 66M 62K 1335
tomcatv 1261M 31M 21K 1036
wave5 4331M 286M 401K 1956

Table 2. SPEC92 Floating Point Program Description.

( All programs are listed in alphabetical order. The number of static conditional branches is the number of different static conditional branches traced. Alvinn and ear are C programs. All the other program are written in Fortran. )

3. Branch Prediction Scheme Design

Figure 2, 3. Bimodal Predictor | Correlation Based Predictor

Figure 4, 5. Index Sharing Predictor | Selective Predictor

4. Branch Prediction Scheme Tuning

Before gathering data for all the benchmark programs, we applied a small set of programs to come up with the best parameters for each scheme. We tested different 2-bit count designs, different correlation depth for the gselect and local, and different global bits for gshare.

Figure 6, 7. 2bit Counter State Diagram Design | Correlation Depth vs. Prediction Accuracy
(click each automaton for a full-sized figure)

Figure 8-9. Local Scheme Correlation Depth vs. Prediction Accuracy.
| Gshare Scheme Global Branch History Bit Adoption vs Prediction Accuracy

5. The Number of Branch Instruction Being Traced

Providing the fast running speed of Shade, we can explore more schemes and run more programs. More importantly, we can trace much more instructions for each program. In our study, we traced most benchmark programs to the end. The regular size of a benchmark program is more than several hundred million and is getting larger. It may beyond the tracing and simulation capability of trace file based approaches to trace so many instructions. It is feasible for most trace file based approaches of trace and simulate at most several tens of million instructions. Since the branch behaviors of some programs will not show until program runs to the middle, our several hundred millions to several billions of instructions' tracing for each program is expected to provide more reliable result and more complete information about those benchmark programs.

Figure 10, 11. Prediction Accuracies vs. Branch Instructions Traced
| Branch Taken Percentage vs. Branch Instruction Traced

The above two figures show variance of the branch behavior and the performance of different schemes through the tracing procedure of the SPEC program gcc/cccp.i The difference between the lowest point in range of 1M~10M branches and the highest point in range of 10M~100M branches is 10% for the taken percentage and 5% for the accuracy.


Project Home | Previous Section: Related Work | Next Section: Results Analysis