Design Methodology

A Comparative Analysis of Branch Prediction Schemes

Zhendong Su and Min Zhou

Computer Science Division
University of California at Berkeley
Berkeley, CA 94720

Design Methodology

First, we briefly describe the experimental system and test benchmark programs we use. Second, we describe the schemes that we implemented and tested. Third, we discuss how we chose the buffer organization and related parameters. Finally, we discuss the issues regarding the number of instructions being traced.

1. Experimental System

Our experiment is conducted on SPARC system 10 with two SuperSparc/60 V8 micro-processors. We compiled c benchmark programs using SUNSoft "cc" version 3.0.1, and compiled fortran programs using SUNSoft "f77" version 3.0.1.

Our data is obtained using "Shade"[SHADE] version 5.15 analyzing programs. Shade is a dynamic code tracer. It links instruction set simulation, trace generation with custom trace analysis. The first advantage of a shade-based simulator over other static trace-based simulators such as pixie based approach is its fast running speed. Shade tends to run fast mainly because the executable being traced, the shade trace generator, and the shade analyzer are all in one single process. The second advantage is that it combines trace generation and trace based analyzer/simulator, thereby avoiding awkward trace file manipulations.

The following figure illustrates the code structure of our shade-based branch prediction simulators.

Figure 1. Shade-Based Branch Prediction Scheme Simulator

The main function is analyze() which is invoked for each traced instruction specified in shade_main(). Provided with the pc - program counter, ea - effective address, and taken/untaken information, we implemented different branch prediction scheme simulators. In analyze(), we also implemented a simple profiler and an execution controller. The profiler gathers the program branch behavior information. The execution controller controls the dynamic execution size. In addition, it controls the time to flush branch history tables when simulating context switching.

2. Benchmark Programs

In different applications and programming languages, conditional branches behave differently. It is important to have a set of benchmark programs that can give a good approximation of the average workload and complexity of the programs that users run. Previous work in this area has been done by tracing the execution of some benchmark programs. In this project, we also use instruction tracing data to measure the performance of different branch prediction schemes. Eight benchmark programs from the beta version of SPEC95 integer program suite and thirteen benchmarks from the SPEC92 floating point program suite are used in this study. Table 1 and 2 list the benchmark programs, their abbreviations that we use, and the testing input data sets used in our experiment.

SPEC95 Integer Program Beta
Benchmark / Input	Dynamic Inst.	Dynamic Cond. Branch	Program / Input size	Static Cond. Branch
gcc / 1amptjp.i	1297M	221M	1697K / 222K	19598
gcc / 1c-decl-s.i	1297M	221M	1697K / 222K	19603
gcc / 1dbxout.i	1664M	28M	1697K / 42K	15455
gcc / 1reload1.i	992M	173M	1697K / 148K	19673
gcc / cccp.i	1298M	223M	1697K / 162K	19514
gcc / insn-emit.i	147M	23M	1697K / 48K	10815
gcc / stmt-protoize.i	986M	165M	1697K / 185K	19746
ghost / convolution.ps-color	1400M	238M	584K / 218K	4262
ghost / convolution.ps-mono	1342M	229M	584K / 218K	4312
ghost / convolution.ps-tiff	1315M	222M	584K / 218K	4330
go / restart.in*	4535M	500M	390K /	5761
go / neardone.in	733M	78M	390K /	4874
go / null.in*	4531M	500M	390K /	5742
m88ksim / dcrand.in*	7007M	1000M	389K / 66K	824
numi / numi.in*	7507M	1000M	31K /	1064
li/li-input.lsp*	12355M	1000M	299K /	1412
perl / jumble.perl*	3471M	500M	400K /	2523
perl / primes.perl*	4762M	500M	400K /	2218
vortex / vortex.in*	7195M	1000M	867K /	7602

Table 1. SPEC95 Integer Program Beta and Input Data Description.

( All programs are listed in alphabetical order. Entries with * denote programs that were interrupted in the middle of tracing. The number of static conditional branches is the number of different static conditional branches traced. Only numi is a Fortran program. All the other programs are written in C. The input data sizes of gcc, ghost and m88ksim are related to the tracing sizes.)

SPEC92 Floating Point Program
Benchmark	Dynamic Inst.	Dynamic Cond. Branch	Program size	Static Cond. Branch
alvinn	6792M	480M	9612	1032
doduc	1644M	87M	247K	2330
ear	14506M	705M	59K	1238
fpppp	8463M	106M	138K	1332
hydro2d	6627M	680M	111K	2356
mdljdp2	4206M	309M	79K	1458
mdljsp2	3011M	338M	98K	1520
nasa7	11104M	217M	91K	1889
ora	2029M	158M	24K	1153
su2cor	8055M	165M	150K	1863
swm256	9862M	66M	62K	1335
tomcatv	1261M	31M	21K	1036
wave5	4331M	286M	401K	1956

Table 2. SPEC92 Floating Point Program Description.

( All programs are listed in alphabetical order. The number of static conditional branches is the number of different static conditional branches traced. Alvinn and ear are C programs. All the other program are written in Fortran. )

3. Branch Prediction Scheme Design

Branch History Based Prediction -- Dynamic Branch Prediction
Branch prediction schemes using small buffers of branch history take advantage of the repetitive branch taken/untaken execution behavior, thereby achieving better prediction accuracy than the simple static prediction schemes. For each conditional branch, an appropriate counter is incremented or decremented. The most significant bit of the counter determines the prediction decision. J. Smith [S81] observed that a 2-bit counter empirically provides an appropriate amount of damping to changes in branch direction. A 1-bit counter simply records the last executed branches direction. In addition, 3-bit or higher counters do not appear to offer large cost/benefit advantages over 2-bit counters. We will further discuss the design of 2-bit counter in a later subsection on predictor tuning.
Bimodal Branch Prediction
Bimodal branch prediction is the simplest 2-bit counter based dynamic prediction scheme. The branch history table is indexed by the low order address bits in the program counter. The following table illustrates the design of the bimodal prediction scheme.

Figure 2, 3. Bimodal Predictor | Correlation Based Predictor

Correlated Branch Prediction Schemes
Correlated branch prediction schemes include common-correlation, gselect, global and local. Since the bimodal scheme takes advantage of the bimodal distribution of branch behavior, it does not perform well when branches have strong dynamic behavior. Correlated prediction schemes are designed to take advantage of relationship between different branch instructions -- certain repetitive branch pattern of several consecutive branches. One correlation based predictor uses two branch history tables. The first table records the history of recent branches -- global history. Each entry is implemented using a shift register. The second table records the branch history for each branch. It is organized as a matrix with rows and columns. Each entry is a 2-bit counter. The pc determines which shift register in the first table and which row of the 2-bit counters of the second table should be used. The chosen global shift register indexes the appropriate counter from the selected row of counters. Prediction is made based the selected counter. The selected shift register and the 2-bit counter will be updated afterwards accordingly. Figure 3 above illustrates the design of correlated schemes.
There are many ways of using pc to index the first and the second tables. Yeh and Patt [YP93] classified these methods into per_address which uses the low order bits of pc and per_set which uses high or middle range bits of pc. They claimed that per_address method and per_set method have similar performance, and the latter has higher implementation cost. We use per_address in our study.
The well-known common-correlation scheme is a correlated scheme that uses a single 2-bit shift register as the global branch history table, and four 2-bit counter for each row of the second table. The 2-bit shift register approach only exploits the correlation between two consecutive branches. Another similar correlated scheme design uses j > 2 bits for the global branch history register and 2^j 2-bit counters for each row of the second table. We adopt the name used by McFarling and refer to it as the gselect scheme. If i equals to 1, i.e. there is just single row in the second table, the scheme is also referred to as the global scheme. Global scheme applies all its buffer for recording the correlation information while ignoring the different branch behavior of a single branch. In most cases, it does not perform as well as other correlated schemes.
More complicated design uses multiple shift registers in the first table. Each register records the branch history of different branches. McFarling referred to it as the local scheme.
Sharing Index Branch Prediction Scheme
Sharing index scheme is referred to as gshare. It was proposed by McFarling. This scheme is similar to the bimodal scheme. It xors a j bits global history shift register with the i bits of the pc before indexing the counter table. Figure 4 illustrates the design of the gshare scheme.

Figure 4, 5. Index Sharing Predictor | Selective Predictor

Selective Branch Prediction Scheme
Different dynamic schemes use different branch history information. Many schemes work well on one type of programs and do not work well on another type of programs. The selective scheme uses two different predictors. Each of two predictors makes prediction independently. A third table is used to track the performance of the two subpredictors and arbitrates which prediction should be used as the final prediction. Selective scheme can perform well on different types of programs. Figure 5 illustrates the design of the selective prediction scheme.
The implementation cost of the selective prediction scheme is three times of the implementation cost of other prediction schemes because two predictors and one selector are used.

Implementation Cost of Different Dynamic Branch Prediction Scheme

The implementation costs of different schemes are shown in the following table. In the table, i is the number of pc bits for indexing the counter table row, j is the number of pc bits for indexing the shift register table, and k is the number of bits of the shift register

scheme name	i	j	k	buffer size
bimodal	variable	n/a	n/a	2*2^i
correlation	variable	1	2	1 + 242^i
gselect	variable	1	variable	k + 2*2^(i+k)
global	1	1	variable	k + 2*2^k
local	variable	variable	variable	k2^j + 22^(i+k)
gshare	variable	n/a	variable	k + 2*2^i

Table 3. Dynamic Predictor Implementation Cost

4. Branch Prediction Scheme Tuning

Before gathering data for all the benchmark programs, we applied a small set of programs to come up with the best parameters for each scheme. We tested different 2-bit count designs, different correlation depth for the gselect and local, and different global bits for gshare.

Prediction History Buffer Size
To have a fair comparison, we choose 8K bits branch prediction buffer, i.e. 4K 2-bit counter entries, for all of the prediction schemes. We will discuss the effect of buffer size on prediction accuracy further in the result analysis section.
2-bit Counter Design
There are many variations in the design of the 2-bit counter state transition automaton. Figure 6 shows four common automaton designs. The two most well-known are automaton1 and automaton2. Assuming that automaton1, discussed in Patterson and Hennessy's book , has better performance, we used automaton1 first. However, according to our experimental results, automaton2 based schemes produced about 0.5% better prediction accuracy then those based on automaton1 when 8kb buffer was used. Used by many prediction schemes, automaton2 is also referred to as a saturating up-down counter. We choose automaton2 based schemes in our final comparison analysis. Automaton3 and automaton4 are similar to automaton2. Their state transition is in more favor of branch taken tendency.

Figure 6, 7. 2bit Counter State Diagram Design | Correlation Depth vs. Prediction Accuracy
(click each automaton for a full-sized figure)

Correlation Depth Selection for Gselect Scheme
From Figure 7 above, 5~6 global bits is the best choice when the branch prediction buffer is 8k bits. We use 5 bits in our comparison analysis since we used gcc here, and we expect fewer global bits should be used for floating point programs.
Local Scheme Correlation Depth Selection
From Figure 8, we observe that the best choice is to use 3 global bits when the buffer is 8K bits.

Figure 8-9. Local Scheme Correlation Depth vs. Prediction Accuracy.
| Gshare Scheme Global Branch History Bit Adoption vs Prediction Accuracy

Sharing Index Scheme Global Bit Selection
From Figure 9 above, 2 global bits is best choice for 8kbit buffer case.
Selective Prediction Sub-Predictor Selection and Buffer Design
There are many variations in the design of selective prediction schemes. Two sub-predictors are chosen from different schemes. Given certain amount of prediction history buffer size, the three prediction history buffer may take different buffer spaces. McFarling used a gshare predictor and a bimodal predictor as the two sub-predictors. We use gselect and bimodal. Considering that it is more beneficial to have larger buffer size for gselect predictor, we use 2Kb buffer for the bimodal predictor, 4Kb buffer for the gselect predictor, and 2kb buffer for the selector. We use 3 global history bits for the gselect sub-predictor.

5. The Number of Branch Instruction Being Traced

Providing the fast running speed of Shade, we can explore more schemes and run more programs. More importantly, we can trace much more instructions for each program. In our study, we traced most benchmark programs to the end. The regular size of a benchmark program is more than several hundred million and is getting larger. It may beyond the tracing and simulation capability of trace file based approaches to trace so many instructions. It is feasible for most trace file based approaches of trace and simulate at most several tens of million instructions. Since the branch behaviors of some programs will not show until program runs to the middle, our several hundred millions to several billions of instructions' tracing for each program is expected to provide more reliable result and more complete information about those benchmark programs.

Figure 10, 11. Prediction Accuracies vs. Branch Instructions Traced
| Branch Taken Percentage vs. Branch Instruction Traced

The above two figures show variance of the branch behavior and the performance of different schemes through the tracing procedure of the SPEC program gcc/cccp.i The difference between the lowest point in range of 1M~10M branches and the highest point in range of 10M~100M branches is 10% for the taken percentage and 5% for the accuracy.

Project Home | Previous Section: Related Work | Next Section: Results Analysis

A Comparative Analysis of Branch Prediction Schemes

Zhendong Su and Min Zhou

Computer Science Division University of California at Berkeley Berkeley, CA 94720

Design Methodology

Computer Science Division
University of California at Berkeley
Berkeley, CA 94720