A Comparative Analysis of Branch Prediction Schemes

Zhendong Su and Min Zhou

Computer Science Division
University of California at Berkeley
Berkeley, CA 94720


Result Analysis and Discussion

The simulation results presented in this section were run with All the schemes except the static predictor use 8K bits, i.e. 4K entries of 2-bit counters. The static scheme that we studied here is an assumed perfect profiling based static scheme without code expansion. For the gshare, gselect, local,and selective predictors, by varying the number of address bits and history depth, we empirically selected the one with the best performance from each group.

To evaluate the performance of the seven different schemes, we first studied the branch behavior of the 21 benchmark programs that we use for this project. From this information, we are able to calculate the accuracy limit on the performance of a static predictor. Second, we give a comprehensive comparison of the seven schemes. Third we look at the effect of varying the buffer size on prediction accuracy for some dynamic schemes. Fourth, the effect of context switching on branch prediction is discussed. Finally, we look at an interesting benchmark program from SPECint95-beta.

1. Benchmark Program Branch Behavior

We first looked at the branch behavior of the 21 benchmark programs used in this study. Table 4 summarizes the information for the 8 integer programs from SPECint95-beta. Table 5 is for the 13 floating point (FP) programs from SPECfp92. The second column and third column of these two table list the number of dynamic instruction traced and dynamic conditional branches traced respectively. The last two columns list the number of taken conditional branches and their percentage with respect to the total number of conditional branches traced from the third column. We traced 16 of the total 21 benchmark programs to the end.

ProgramTraced_Inst#Dynamic_Branch# Taken_Branch# Taken_Branch%
gcc 6186539354 1057556947 566093312 53.53
ghost 4057743776 690986805 339710533 49.16
go 9800796393 1078696639 504611777 46.78
li+ 7007645073 1000000000 551847194 55.18
m88ksim+ 7507989139 1000000000 479341651 47.93
numi+ 12355029197 1000000000 639221685 63.92
perl+ 8234202623 1000000000 474993437 47.50
vortex+ 7195515374 1000000000 425973672 42.60
( + stands for the analyzer was interrupted in the middle of tracing.)

Table 4. Branch behavior of benchmark programs from SPEC95int-beta

ProgramTraced_Inst#Dynamic_Branch# Taken_Branch# Taken_Branch%
alivinn 6792027933 480143481 469154917 97.71
doduc 1644410091 87092576 48388199 55.56
ear 14506557248 705032704 466983346 66.24
fpppp 8463667939 106307360 67686215 63.67
hydro2d 6627612755 680214394 515386882 75.77
mdljdp2 4206065214 309377515 215543933 69.67
mdljsp2 3011635408 338499291 195972810 57.89
nasa7 11104431137 217326271 186014771 85.59
ora 2029511987 158386472 82174861 51.88
su2cor 8055850151 165611544 140751098 84.99
swm256 9862718926 66039940 61056566 92.45
tomcatv 1261279753 31605162 30958962 97.96
wave5 4331716191 286632343 194798779 67.96

Table 5. Branch behavior of benchmark programs from SPEC92fp

Figure 12 The Frequencies of Conditional Branches

Figure 13 Percentage of conditional branches that are taken

Figure 14 Branch bias, weighted by execution frequency.

2. Scheme Comparison

Figure 15 shows the prediction accuracies of the seven prediction schemes for each of the 21 benchmarks. All the schemes use 8K bits branch prediction buffer size except the static scheme. For most of the 21 benchmarks, the static predictor's prediction accuracy is the lowest of all the seven schemes measured. In contrary, the selective predictor achieves the highest prediction accuracies on most of the benchmarks. Of all the six dynamic predictors, the bimodal scheme, which is the simplest dynamic scheme, performs the worst. gshare and common correlation are slightly better. The prediction accuracies of local and gselect are slightly lower than that of selective.

Figure 15 Seven schemes performance on the 21 benchmark programs
(Click the following hyper texts to have detailed views of the comparison.)
comparison of static, bimodal and common-correlation schemes
comparison of gshare, common-correlation, local and gselect schemes
comparison of gselect, local and selective schemes
comparison of gselect, local and common-correlation schemes
comparison of static, bimodal and selective schemes

Figure 16 Seven schemes average prediction accuracy over the 21 benchmark programs

Benchmarks Schemes ordered by performance (from worst to best)
INT static bimodal gshare correlation local gselect selective
89.8% 89.8% 90.3% 90.8% 91.3% 91.8% 92.7%
FP static bimodal correlation gshare gselect selective local
93.3% 94.4% 94.7% 94.7% 95.3% 95.5% 95.6%
ALL static bimodal gshare correlation local gselect selective
92.0% 92.6% 93.0% 93.2% 93.9% 93.9% 94.4%

Table 6. Performance summary of the seven schemes

Figure 16 shows the seven prediction schemes' average prediction accuracies for the integrer programs, the FP programs, and for all the 21 programs. Table 6 summaries the relative performance of the seven schemes.

We observe the following:

3. Effects of Changing Buffer Size

Advancing VLSI technology makes it possible to have larger branch perdition table in the near future. We look at the effect of varying the buffer size over three dynamic schemes: bimodal, common-correlation, and gselect. The benchmark program that we tested on is gcc with cccp.i as the input file.

The results are shown in Figure 17, with curves for the three dynamic schemes. The curve for the bimodal scheme goes flat when the number of 2-bit counters gets above about 5000. This agrees with results from previous research. Therefore, for the bimodal scheme, buffer size is not a limiting factor when it is more than 8K bits (4K 2-bit counters). The bimodal scheme behaves this way because in a typical run of program, 5000 different branches is considered to be large. Shown in Table 1 and Table 2, excluding gcc and vortex, all benchmarks studied have only between 1000 and 5000 static conditional branches. The common correlation scheme still has noticeable improvement with increased buffer size till 20K bits or 10K 2-bit counters. Even when the buffer size reaches 200K bits, the gselect scheme still shows significant improvement if the correlation depth is also increased accordingly.

Figure 17 Effects of varying branch prediction buffer size

4. Effects of Context Switching

Context switches are frequently experienced in multitasking computer systems. After a context switch, the branch prediction tables are normally invalided or flushed. Thus context switches have detrimental effects on branch prediction rate. To observe this effect, we flushed the prediction table after a certain number of instructions. The number of instructions between context switches are 5K, 10K, ... , and 2.6M. We used three dynamic predictors each using 8K bits: bimodal, common-correlation and gselect. The benchmark program that we tested on is gcc from SPECint95-beta with cccp.i as the input file.

The results are shown in Figure 18, with curves for bimodal, correlation, and gselect. The three horizontal lines in the figure show the prediction accuracies for each scheme in the case of no context switch. For all three schemes, we observe that the effect of context switches on prediction rate decreases as the number of instructions between context switches gets larger. We see very little effect after the number of instructions is over 1 million.

Since the increased speed in CPUs, the number of instructions between context switches is getting larger. For a 50-MIPs machine, it is about 3 million instructions assuming the commonly used 16 context switches per second on a UNIX system. Therefore, context switching has little effect on branch prediction accuracy. However, threads and multiprocessing are increasing in importance. The number of instructions between context switches for these lightweight processes sometimes is much shorter than conventional context switches. For this reason, the effect of context switches should not be overlooked.

We also notice that for less complicated schemes such as bimodal, the effect of context switches is not as evident as for some more complicated schemes such as gselect. This indicates that complicated schemes need longer time to warm up than simple schemes.

Figure 18 Effects of context switching on prediction accuracy

5. A Special Case Study: go

In analyzing the simulation results, we observed that every scheme performed poorly on go, a new benchmark program from SPEC95int-beta. To find out why, we did some extra study on this benchmark.


Project Home | Previous Section: Design Methodology | Next Section: Future Work