A Comparative Analysis of Branch Prediction Schemes

Zhendong Su and Min Zhou

Computer Science Division
University of California at Berkeley
Berkeley, CA 94720


Abstract

Conditional branches are major obstacles to achieve higher performance for a high performance CPU. Accurate branch prediction is required to overcome this performance limitation imposed on high performance architectures and is the key to many techniques for enhancing and exploiting Instruction Level parallelism (ILP). Many different branch prediction schemes have been proposed. Most of these work has been based on benchmark programs including SPEC89 and SPEC92. In this report, we present a comparative analysis for a few well known branch prediction schemes on SPARC architecture based on a partially new collection of benchmark programs including SPECint95-beta and SPECfp92. Comparing to previous work, we have several interesting findings. In this paper, we first show the performance of several well-known dynamic branch prediction schemes. From the results obtained, we conclude that selective predictor achieves the least miss predict rate with the same size of branch prediction buffer. We observe that static predictors without code expansion cannot compete with dynamic predictors. One SPECint95 program experiences very low prediction accuracy using those common schemes. We also observe that misleading data and conclusions may result from either tracing only a few testing programs or tracing just small portion of a program. Finally, we notice context switching has little impact on branch prediction with today's fast CPU, and complex schemes need longer time to warm up than simple schemes.


Project Home | Next Section: Introduction

Introduction

Today's fast CPUs allow very deep pipelines and wide issue rates, which are two of the most effective ways of improving performance of processors. Branches impede machine performance in that conditional branch is not resolved until the condition is resolved and the target address is calculated, and unconditional branch is not resolved until the target address is calculated. As pipelines get deeper or issuing rate gets higher, the penalty imposed by branches gets larger. One way to reduce this penalty is predicting the direction of a conditional branch, pre-fetching, decoding, and executing the instruction at the branch target. A large amount of speculative work has to be thrown away after a branch miss predication. This results in higher misprediction penalty as the memory hierarchy getting more complex. Extremely accurate branch prediction is thus the key to reduce this penalty. Many schemes have been proposed to reduce prediction miss rate.

Branch prediction schemes can be classified into static schemes and dynamic schemes by the way the prediction is made. Static prediction schemes can be simple. The most straight forward one is predicting branches to be always taken by observing that majority of branches are taken. As reported by Lee ad Smith [LS92], this simple strategy can predict correctly 68% of the time. In our study, out of the dynamic instructions traced, 65% of the conditional branches are taken. Our traces also indicate that this simple approach may result in less than 50% correct prediction for some integer programs. Static schemes can also be based on branches' opcodes. Another simple method is using the direction of the branches to make a prediction. If the branch is backward, i.e., the target address is smaller than the PC of the branch instruction, it is predicted to be taken. Otherwise, if the branch is forward, it is predicted to be not taken. This strategy tries to take advantage of loops in the program. It works well for programs with many looping structures. However, it does not work well in the case there are many irregular branches. Profiling is another static strategy which uses previous runs of a program to collect information on the tendencies of a given branch to be taken or not taken and preset a static prediction bit in the opcode of the given branch. Later runs of the program can use this information to make predictions. This strategy suffers from the fact that runs of a program with different input data sets usually result in different branch behaviors. Recently, C. Young and M. Smith proposed static correlated branch prediction(SCBP) trading off increased code size with increased prediction accuracy. At this time, we do not know whether this approach will yield any performance improvement. For more information, refer to [YS94]. In this project, we studied the limit of static approach without code expansion. Our results indicate that the static schemes without code expansion are not comparable to dynamic approaches.

Dynamic schemes are different from static ones in the sense that they use the run-time behavior of branches to make predictions. J. Smith [S81] gave a survey of early simple static and dynamic schemes. The best scheme in his paper is the one which uses 2-bit saturating up-down counters to collect history information which is then used to make predictions. This is perhaps the most well-known technique. McFarling [M93] referred to it as the bimodal branch prediction. There are several variations in the design of the 2-bit counter. Yeh and Patt [YP91] discussed these variations. In many programs with intensive control flow, very often the direction of a branch is affected by the behavior of other branches. By observing this fact, Pan, So, & Rahmeh [PSR92] and Yeh & Patt [YP91] independently proposed correlated branch prediction schemes or two-level adaptive branch prediction schemes. This new approach improved the prediction accuracy by a large factor. Yeh and Patt [YP93] classified the variations of dynamic schemes that using two levels of branch history. McFarling [M93] exploited the possibility of combining branch predictors to achieve even higher prediction accuracy.

Computer technology is advancing at a rapid speed. Advanced VLSI technology makes it possible to have larger branch prediction table and more complicated schemes. The advancement in programming languages also makes it possible to have larger and more complicated programs, and allows more cross-references between branches because of more complicated procedure calls. Multiprocessing and threading become important because of the rise of Multiple Instruction streams, Multiple data streams(MIMD) machines. Therefore, it is important to look at the effect of these advancements on branch prediction.

In this project, we look at some of the following issues. In the literature, there are many MIPS, Alpha, HP-PA, and Power architecture based branch prediction research. We are interested in knowing the impact of a different architecture, SPARC architecture, on branch prediction. Our results do not show any significant, if at all any, impact of architecture on branch prediction. We get similar results as compared to results from previous research. Taking advantage of the fast simulation speed of Shade, we are able to trace much larger programs, try out many different schemes, and experiment with different parameters of the schemes in a reasonable amount of time. We notice that the number of instructions traced clearly affects the resulted branch behavior and prediction accuracy, and that the selection and the size of the testing program set also affects the comparison over different schemes. Since in different applications and programming languages conditional branches behave differently, it is important to have a set of benchmark programs that can truthfully represents the average workload and complexity of the programs people run. We use a partially new collection of programs, which includes 8 SPECint95 beta benchmarks and 13 SPECfp92 benchmarks, to see how well the well-known schemes work on these new programs. We do observe that one new SPEC95 program go has much different branch behavior from previous SPEC programs. In this paper, we first show the performance of several well-known dynamic branch prediction schemes. From the results, we conclude that selective predictor achieves the highest prediction accuracy with the same size of branch prediction buffer. We also observe that conventional static predictors cannot compete with dynamic predictors, and context switching has little impact on branch prediction with today's fast CPU. In addition, complex schemes require longer time to warm up than simple schemes do.

The rest of the report is organized as follows: the related work section gives references to previous work on branch prediction. The design methodology section discusses the methodology used in this study: how the simulated prediction models and testing programs are selected, and how the simulated prediction schemes are designed and implemented. The result analysis section discusses our findings from traces of the benchmark programs. The future work section presents some of the work that may be interesting to explore. The last section summaries the report.


Project Home | Previous Section: Abstract | Next Section: Related Work

Related Work

Branch prediction performance issues have been studied extensively. J. Smith [S81] gave a survey of early simple static and dynamic schemes. The best scheme in his paper is the one which uses 2-bit saturating up/down counters to collect history information which is then used to make predictions. This is perhaps the most well-known technique. McFarling [M93] referred to it as bimodal branch prediction. It was also referred to as one-level branch prediction in Yeh and Yatt 's paper [YP91]. We will discuss this scheme in more detail in later sections. Lee and Smith [LS92] evaluated several branch prediction schemes. In addition, they addressed how to use branch target buffers to reduce the delay due to target address calculation. McFarling and Hennessy [MH86] compared various hardware and software approaches to reducing branch cost including using profiling information. Fisher and Freudenberger [FF92] studied the stability of profile information across separate runs of a program. In many programs with intensive control flow, very often the direction of a branch is affected by behavior of other branches. By observing this fact, Pan, So, & Rahmeh [PSR92] and Yeh & Patt [YP91] independently proposed correlated branch prediction schemes, also called two-level adaptive branch prediction schemes in Yeh and Patt's paper. Correlation schemes use both single conditional branch branch history and global branch history. Pan, So, Rahmeh [PSR92] described how both global history and branch address information can be used in one predictor. This new approach improved the prediction accuracy by a large factor. There are several variations of this kind of dynamic schemes by using different indexing method and buffer organizations. Yeh and Patt gave a comparison of these approaches. [YP91] In designing the 2-bit counter used in many of the dynamic schemes, several variations exist. Yeh and Patt [YP93] discussed these variations. McFarling [M93] exploited the possibility of combining branch predictors to achieve even higher prediction accuracy. He also presented a sharing index scheme, referred to as gshare, and a new scheme using combined predictors. Ball and Larus [BL93] described several techniques for guessing the most common branches directions at compile time using static information. Young and M. Smith [YM94] [YM95] introduced the notion of static correlation branch prediction (SCBP). In a recent paper [GSY], Gloy, M. Smith, and Young addressed performance issues of this approach. They claimed a better performance in comparison to some dynamic approaches. Several studies [JW89] [W91] have looked at the implications of branches on available instruction level parallelism (ILP). These studies show that branch prediction miss is a crucial parameter in determining the amount of parallelism that can be exploited.


Project Home | Previous Section: Introduction | Next Section: Design Methodology

Design Methodology

First, we briefly describe the experimental system and test benchmark programs we use. Second, we describe the schemes that we implemented and tested. Third, we discuss how we chose the buffer organization and related parameters. Finally, we discuss the issues regarding the number of instructions being traced.

1. Experimental System

Our experiment is conducted on SPARC system 10 with two SuperSparc/60 V8 micro-processors. We compiled c benchmark programs using SUNSoft "cc" version 3.0.1, and compiled fortran programs using SUNSoft "f77" version 3.0.1.

Our data is obtained using "Shade"[SHADE] version 5.15 analyzing programs. Shade is a dynamic code tracer. It links instruction set simulation, trace generation with custom trace analysis. The first advantage of a shade-based simulator over other static trace-based simulators such as pixie based approach is its fast running speed. Shade tends to run fast mainly because the executable being traced, the shade trace generator, and the shade analyzer are all in one single process. The second advantage is that it combines trace generation and trace based analyzer/simulator, thereby avoiding awkward trace file manipulations.

The following figure illustrates the code structure of our shade-based branch prediction simulators.

Figure 1. Shade-Based Branch Prediction Scheme Simulator

The main function is analyze() which is invoked for each traced instruction specified in shade_main(). Provided with the pc - program counter, ea - effective address, and taken/untaken information, we implemented different branch prediction scheme simulators. In analyze(), we also implemented a simple profiler and an execution controller. The profiler gathers the program branch behavior information. The execution controller controls the dynamic execution size. In addition, it controls the time to flush branch history tables when simulating context switching.

2. Benchmark Programs

In different applications and programming languages, conditional branches behave differently. It is important to have a set of benchmark programs that can give a good approximation of the average workload and complexity of the programs that users run. Previous work in this area has been done by tracing the execution of some benchmark programs. In this project, we also use instruction tracing data to measure the performance of different branch prediction schemes. Eight benchmark programs from the beta version of SPEC95 integer program suite and thirteen benchmarks from the SPEC92 floating point program suite are used in this study. Table 1 and 2 list the benchmark programs, their abbreviations that we use, and the testing input data sets used in our experiment.

SPEC95 Integer Program Beta
Benchmark / Input Dynamic Inst. Dynamic Cond. Branch Program / Input size Static Cond. Branch
gcc / 1amptjp.i 1297M 221M 1697K / 222K 19598
gcc / 1c-decl-s.i 1297M 221M 1697K / 222K 19603
gcc / 1dbxout.i 1664M 28M 1697K / 42K 15455
gcc / 1reload1.i 992M 173M 1697K / 148K 19673
gcc / cccp.i 1298M 223M 1697K / 162K 19514
gcc / insn-emit.i 147M 23M 1697K / 48K 10815
gcc / stmt-protoize.i 986M 165M 1697K / 185K 19746
ghost / convolution.ps-color 1400M 238M 584K / 218K 4262
ghost / convolution.ps-mono 1342M 229M 584K / 218K 4312
ghost / convolution.ps-tiff 1315M 222M 584K / 218K 4330
go / restart.in* 4535M 500M 390K / 5761
go / neardone.in 733M 78M 390K / 4874
go / null.in* 4531M 500M 390K / 5742
m88ksim / dcrand.in* 7007M 1000M 389K / 66K 824
numi / numi.in* 7507M 1000M 31K / 1064
li/li-input.lsp* 12355M 1000M 299K / 1412
perl / jumble.perl* 3471M 500M 400K / 2523
perl / primes.perl* 4762M 500M 400K / 2218
vortex / vortex.in* 7195M 1000M 867K / 7602

Table 1. SPEC95 Integer Program Beta and Input Data Description.

( All programs are listed in alphabetical order. Entries with * denote programs that were interrupted in the middle of tracing. The number of static conditional branches is the number of different static conditional branches traced. Only numi is a Fortran program. All the other programs are written in C. The input data sizes of gcc, ghost and m88ksim are related to the tracing sizes.)

SPEC92 Floating Point Program
Benchmark Dynamic Inst. Dynamic Cond. Branch Program size Static Cond. Branch
alvinn 6792M 480M 9612 1032
doduc 1644M 87M 247K 2330
ear 14506M 705M 59K 1238
fpppp 8463M 106M 138K 1332
hydro2d 6627M 680M 111K 2356
mdljdp2 4206M 309M 79K 1458
mdljsp2 3011M 338M 98K 1520
nasa7 11104M 217M 91K 1889
ora 2029M 158M 24K 1153
su2cor 8055M 165M 150K 1863
swm256 9862M 66M 62K 1335
tomcatv 1261M 31M 21K 1036
wave5 4331M 286M 401K 1956

Table 2. SPEC92 Floating Point Program Description.

( All programs are listed in alphabetical order. The number of static conditional branches is the number of different static conditional branches traced. Alvinn and ear are C programs. All the other program are written in Fortran. )

3. Branch Prediction Scheme Design

Figure 2, 3. Bimodal Predictor | Correlation Based Predictor

Figure 4, 5. Index Sharing Predictor | Selective Predictor

4. Branch Prediction Scheme Tuning

Before gathering data for all the benchmark programs, we applied a small set of programs to come up with the best parameters for each scheme. We tested different 2-bit count designs, different correlation depth for the gselect and local, and different global bits for gshare.

Figure 6, 7. 2bit Counter State Diagram Design | Correlation Depth vs. Prediction Accuracy
(click each automaton for a full-sized figure)

Figure 8-9. Local Scheme Correlation Depth vs. Prediction Accuracy.
| Gshare Scheme Global Branch History Bit Adoption vs Prediction Accuracy

5. The Number of Branch Instruction Being Traced

Providing the fast running speed of Shade, we can explore more schemes and run more programs. More importantly, we can trace much more instructions for each program. In our study, we traced most benchmark programs to the end. The regular size of a benchmark program is more than several hundred million and is getting larger. It may beyond the tracing and simulation capability of trace file based approaches to trace so many instructions. It is feasible for most trace file based approaches of trace and simulate at most several tens of million instructions. Since the branch behaviors of some programs will not show until program runs to the middle, our several hundred millions to several billions of instructions' tracing for each program is expected to provide more reliable result and more complete information about those benchmark programs.

Figure 10, 11. Prediction Accuracies vs. Branch Instructions Traced
| Branch Taken Percentage vs. Branch Instruction Traced

The above two figures show variance of the branch behavior and the performance of different schemes through the tracing procedure of the SPEC program gcc/cccp.i The difference between the lowest point in range of 1M~10M branches and the highest point in range of 10M~100M branches is 10% for the taken percentage and 5% for the accuracy.


Project Home | Previous Section: Related Work | Next Section: Results Analysis

Result Analysis and Discussion

The simulation results presented in this section were run with All the schemes except the static predictor use 8K bits, i.e. 4K entries of 2-bit counters. The static scheme that we studied here is an assumed perfect profiling based static scheme without code expansion. For the gshare, gselect, local,and selective predictors, by varying the number of address bits and history depth, we empirically selected the one with the best performance from each group.

To evaluate the performance of the seven different schemes, we first studied the branch behavior of the 21 benchmark programs that we use for this project. From this information, we are able to calculate the accuracy limit on the performance of a static predictor. Second, we give a comprehensive comparison of the seven schemes. Third we look at the effect of varying the buffer size on prediction accuracy for some dynamic schemes. Fourth, the effect of context switching on branch prediction is discussed. Finally, we look at an interesting benchmark program from SPECint95-beta.

1. Benchmark Program Branch Behavior

We first looked at the branch behavior of the 21 benchmark programs used in this study. Table 4 summarizes the information for the 8 integer programs from SPECint95-beta. Table 5 is for the 13 floating point (FP) programs from SPECfp92. The second column and third column of these two table list the number of dynamic instruction traced and dynamic conditional branches traced respectively. The last two columns list the number of taken conditional branches and their percentage with respect to the total number of conditional branches traced from the third column. We traced 16 of the total 21 benchmark programs to the end.

ProgramTraced_Inst#Dynamic_Branch# Taken_Branch# Taken_Branch%
gcc 6186539354 1057556947 566093312 53.53
ghost 4057743776 690986805 339710533 49.16
go 9800796393 1078696639 504611777 46.78
li+ 7007645073 1000000000 551847194 55.18
m88ksim+ 7507989139 1000000000 479341651 47.93
numi+ 12355029197 1000000000 639221685 63.92
perl+ 8234202623 1000000000 474993437 47.50
vortex+ 7195515374 1000000000 425973672 42.60
( + stands for the analyzer was interrupted in the middle of tracing.)

Table 4. Branch behavior of benchmark programs from SPEC95int-beta

ProgramTraced_Inst#Dynamic_Branch# Taken_Branch# Taken_Branch%
alivinn 6792027933 480143481 469154917 97.71
doduc 1644410091 87092576 48388199 55.56
ear 14506557248 705032704 466983346 66.24
fpppp 8463667939 106307360 67686215 63.67
hydro2d 6627612755 680214394 515386882 75.77
mdljdp2 4206065214 309377515 215543933 69.67
mdljsp2 3011635408 338499291 195972810 57.89
nasa7 11104431137 217326271 186014771 85.59
ora 2029511987 158386472 82174861 51.88
su2cor 8055850151 165611544 140751098 84.99
swm256 9862718926 66039940 61056566 92.45
tomcatv 1261279753 31605162 30958962 97.96
wave5 4331716191 286632343 194798779 67.96

Table 5. Branch behavior of benchmark programs from SPEC92fp

Figure 12 The Frequencies of Conditional Branches

Figure 13 Percentage of conditional branches that are taken

Figure 14 Branch bias, weighted by execution frequency.

2. Scheme Comparison

Figure 15 shows the prediction accuracies of the seven prediction schemes for each of the 21 benchmarks. All the schemes use 8K bits branch prediction buffer size except the static scheme. For most of the 21 benchmarks, the static predictor's prediction accuracy is the lowest of all the seven schemes measured. In contrary, the selective predictor achieves the highest prediction accuracies on most of the benchmarks. Of all the six dynamic predictors, the bimodal scheme, which is the simplest dynamic scheme, performs the worst. gshare and common correlation are slightly better. The prediction accuracies of local and gselect are slightly lower than that of selective.

Figure 15 Seven schemes performance on the 21 benchmark programs
(Click the following hyper texts to have detailed views of the comparison.)
comparison of static, bimodal and common-correlation schemes
comparison of gshare, common-correlation, local and gselect schemes
comparison of gselect, local and selective schemes
comparison of gselect, local and common-correlation schemes
comparison of static, bimodal and selective schemes

Figure 16 Seven schemes average prediction accuracy over the 21 benchmark programs

Benchmarks Schemes ordered by performance (from worst to best)
INT static bimodal gshare correlation local gselect selective
89.8% 89.8% 90.3% 90.8% 91.3% 91.8% 92.7%
FP static bimodal correlation gshare gselect selective local
93.3% 94.4% 94.7% 94.7% 95.3% 95.5% 95.6%
ALL static bimodal gshare correlation local gselect selective
92.0% 92.6% 93.0% 93.2% 93.9% 93.9% 94.4%

Table 6. Performance summary of the seven schemes

Figure 16 shows the seven prediction schemes' average prediction accuracies for the integrer programs, the FP programs, and for all the 21 programs. Table 6 summaries the relative performance of the seven schemes.

We observe the following:

3. Effects of Changing Buffer Size

Advancing VLSI technology makes it possible to have larger branch perdition table in the near future. We look at the effect of varying the buffer size over three dynamic schemes: bimodal, common-correlation, and gselect. The benchmark program that we tested on is gcc with cccp.i as the input file.

The results are shown in Figure 17, with curves for the three dynamic schemes. The curve for the bimodal scheme goes flat when the number of 2-bit counters gets above about 5000. This agrees with results from previous research. Therefore, for the bimodal scheme, buffer size is not a limiting factor when it is more than 8K bits (4K 2-bit counters). The bimodal scheme behaves this way because in a typical run of program, 5000 different branches is considered to be large. Shown in Table 1 and Table 2, excluding gcc and vortex, all benchmarks studied have only between 1000 and 5000 static conditional branches. The common correlation scheme still has noticeable improvement with increased buffer size till 20K bits or 10K 2-bit counters. Even when the buffer size reaches 200K bits, the gselect scheme still shows significant improvement if the correlation depth is also increased accordingly.

Figure 17 Effects of varying branch prediction buffer size

4. Effects of Context Switching

Context switches are frequently experienced in multitasking computer systems. After a context switch, the branch prediction tables are normally invalided or flushed. Thus context switches have detrimental effects on branch prediction rate. To observe this effect, we flushed the prediction table after a certain number of instructions. The number of instructions between context switches are 5K, 10K, ... , and 2.6M. We used three dynamic predictors each using 8K bits: bimodal, common-correlation and gselect. The benchmark program that we tested on is gcc from SPECint95-beta with cccp.i as the input file.

The results are shown in Figure 18, with curves for bimodal, correlation, and gselect. The three horizontal lines in the figure show the prediction accuracies for each scheme in the case of no context switch. For all three schemes, we observe that the effect of context switches on prediction rate decreases as the number of instructions between context switches gets larger. We see very little effect after the number of instructions is over 1 million.

Since the increased speed in CPUs, the number of instructions between context switches is getting larger. For a 50-MIPs machine, it is about 3 million instructions assuming the commonly used 16 context switches per second on a UNIX system. Therefore, context switching has little effect on branch prediction accuracy. However, threads and multiprocessing are increasing in importance. The number of instructions between context switches for these lightweight processes sometimes is much shorter than conventional context switches. For this reason, the effect of context switches should not be overlooked.

We also notice that for less complicated schemes such as bimodal, the effect of context switches is not as evident as for some more complicated schemes such as gselect. This indicates that complicated schemes need longer time to warm up than simple schemes.

Figure 18 Effects of context switching on prediction accuracy

5. A Special Case Study: go

In analyzing the simulation results, we observed that every scheme performed poorly on go, a new benchmark program from SPEC95int-beta. To find out why, we did some extra study on this benchmark.


Project Home | Previous Section: Design Methodology | Next Section: Future Work

Future Work

There are a number of ways that this study can be extended.


Project Home | Previous Section: Results Analysis | Next Section: Conclusion

Conclusion

In this report, we have analyzed the performance of several most well-known and effective branch prediction schemes with respect to their branch prediction performance and cost effectiveness. The schemes that we have looked at are static, bimodal, common correlation, gshare, local, gselect, and selective. Thanks for the fast speed of the Shade analyzers, we were able to trace hundreds times more instructions than previous research, and more accurate results could be obtained.

The following findings may be of interest for future research in branch prediction and new architecture design.


Project Home | Previous Section: Future Work | Next Section: References

References

[BL93] T.Ball and J.Larus, "Branch Prediction for Free", Proceedings of the ACM SIGPLAN '93 Conference on Programming Language Design and Implementation, 1993.

[CG94] B. Calder and D. Grunwald, " Fast & Accurate Instruction Fetch and Branch Prediction," Intl. Symp. on Computer Architecture, Apr. 1994.

[FF92] J. Fisher and S. Freudenberger, "Predicting Conditional Branch Directions From Previous Runs of a Program", Proc. 5th Annual Intl. Conf. on Architectural Support for Prog. Lang. and Operating Systems, Oct. 1992.

[GSM95] N. Gloy, M. Smith, and C. Young, " Performance Issues in Correlated Branch Prediction Schemes," to appear in the Proc. 28th Annual IEEE/ACM Intl. Symp. on Microarchitecture, Nov. 1995.

[JW89] N. Jouppi and D. Wall, "Available Instruction-level Parallelism for superscalar and Superpipelined Machines", Proceedings of ASPLOS III, April 1989.

[LS92] J. Lee and A. Smith, "Branch Prediction Strategies and Branch Target Buffer Design", Computer 17:1 Jan. 1984.

[M93] S. McFarling, " Combining Branch Predictors," TR, Digital Western Research Laboratory,Jun. 1993

[MH86] S. MaFarling and J. Hennessy "Reducing the Cost of Branches", Proc. of 13th Annual Intl. Symp. on Computer Architecture, Jun. 1986.

[PH95] D. Patterson and J. Hennessy, "Computer Architecture: A Quantitative Approach, 2nd Edition," Morgan Kaufmann Publishers, Inc., 1995.

[PSR] S. Pan, K. So, and J. Rahmeh, "Improving the Accuracy of Dynamic Branch Prediction Using Branch Correlation," Proc. 5th Annual Intl. Conf. on Architectural Support for Prog. Lang. and Operating Systems, Oct. 1992.

[S81] J. Smith, "A Study of Branch Prediction Strategies," Proc. 8th Annual Intl. Symp. on Computer Architecture, May 1981.

[SHADE] Sun Microsystems, " Shade Manual."

[YP93] T. Yeh and Y. Patt, "A Comparison of Dynamic Branch Predictors that use Two Levels of Branch History," Proc. 20th Annual Intl. Symp. on Computer Architecture, May 1993.

[YP91] T. Yeh and Y. Patt, "Two-Level Adaptive Training Branch Prediction," Proc. 24th Annual ACM/IEEE Intl. Symp. and Workshop on Microarchitecture, Nov. 1991.

[YS95] C. Young and M. Smith, " A Comparative Analysis of Schemes for Correlated Branch Prediction", Proc. 22nd Annual Intl. Symp. on Computer Architecture, June 1995.

[YS94] C. Young and M. Smith, " Improving the Accuracy of Static Branch Prediction Using Branch Correlation", Proc. 6th Intl. Conf. on Architectural Support for Prog. Lang. and Operating Systems, October 1994.

[W91] D. Wall, "Limits of Instruction-level Parallelism", Proceedings of ASPLOS IV, April 1991.


Project Home | Previous Section: Conclusion