## WPE Exam Question for Computer Architecture - Fall 2012 (a) [0.8 pts] An IEEE single-precision floating point number has 32 bits. For simplicity, consider a 5-bit floating point representation that maintains all the other characteristics of the IEEE floating point standard (denormalization, etc.). Our 5-bit floating point representation uses 1 sign bit (S), a 2-bit mantissa (M), and a 2-bit exponent (E) in excess-1 format (bias of 1), as depicted below. | 1 | | | | |---|-----------|------------|------------| | | S (1 bit) | E (2 bits) | M (2 bits) | Fill in the table below showing the decimal value and excess-1 notation for each of the possible 2-bit exponents. Then, using the table, find the decimal values of the numbers A=00001 and B=00100. | 2s complement | Decimal | Excess-1 | |---------------|---------|----------| | 00 | | 7 | | 01 | | | | 10 | | | | 11 | | | A = 00001 = ?B = 00100 = ? (b) [0.8 pts] Considering the 5-bit floating point representation and the values of A and B from part (a), find the value of the register RZ in each of the programs below, when the programs are executed on an in-order processor. The instruction ADDFP RA RB RZ adds the contents of register RA to the contents of register RB using a 5-bit floating point adder and stores the result in register RZ. In each program, assume that the initial contents of RA and RB are the values of A and B from part (a), respectively. | Program 1: | | | | Program 2: | |------------|----|----|----|----------------| | ADDFP | RA | RA | RX | ADDFP RB RB RX | | ADDFP | RX | RB | RY | ADDFP RX RA RY | | ADDFP | RY | RB | RZ | ADDFP RY RA RZ | (c) [0.8 pts] The performance of a processor pipeline depend on the pipeline depth of the processor. You are tasked with finding the optimal pipeline depth of a processor using the following model. Let $T_B$ represent the time that the processor is busy executing instructions, and $T_{NB}$ represent the time that the processor is not busy executing instructions. Non-busy time is due to pipeline stalls. The total time to execute a program (T) equals the busy time plus the non-busy time. $$T = T_B + T_{NB}$$ Busy time for a program execution on a scalar processor can be expressed as the number of instruction in a program $(N_I)$ multiplied by the time to pass through each pipeline stage $(t_S)$ . $$T_B = N_I t_S$$ The pipeline stage delay can be expressed in terms of the total logic delay of the processor $(t_P)$ , the latch delay $(t_0)$ , and the number of pipeline stages (p). $$t_S = t_O + t_P/p$$ A superscalar processor can execute multiple instructions at the same time. Let us define the average degree of superscalar processing (S), such that the busy time on a superscalar processor is expressed as follows. $$T_B = \frac{N_I}{S} t_S$$ The non-busy time can be expressed as the number of hazards ( $N_H$ ) multiplied by the average fraction ( $f_H$ ) of total pipeline delay ( $t_{PIPE}$ ) that the pipeline spends stalled when there is a hazard. $$T_{NB} = N_H f_H T_{PIPE}$$ Write an expression for the total pipeline delay ( $T_{PIPE}$ ) in terms of the total logic delay of the processor ( $t_P$ ), the latch delay ( $t_0$ ), and the number of pipeline stages (p). The total pipeline delay is the end-to-end pipeline delay including all logic stages and latches. $$T_{PIPE} = ?$$ Based on the formulations above, write an expression for the performance of the processor (TPI), where performance is defined as the total execution time of a program (T) divided by the number of instructions in the program. $$TPI = ?$$ As mentioned above, the performance of a processor pipeline depends on the pipeline depth. Find an expression for the performance-optimal pipeline depth ( $p_{opt}$ ) of the superscalar processor described above. In other words, find a simplified expression for the optimal pipeline depth that minimizes TPI. $$p_{opt} = ?$$ (d) [0.8 pts] Based on your expression for the optimal pipeline depth in part (c): Does the optimal pipeline depth increase or decrease for a program with more hazards? Explain the intuition behind your answer based on your knowledge of pipelined processors. Does the optimal pipeline depth increase or decrease as the latch delay decreases? Explain the intuition behind your answer. Does the optimal pipeline depth increase or decrease as the average degree of superscalar processing increases? Explain the intuition behind your answer. Does the optimal pipeline depth increase or decrease as the average fraction of the pipeline that stalls when there is a hazard decreases? Explain the intuition behind your answer. (e) [0.8 pts] The peak performance of a processor is often dictated by how fast the processor can bring data onto the chip and process the data. Consider a many-core processor with 256 scalar, in-order cores that operate at 1.4 GHz. Each core can perform up to two single-precision floating point operations per clock period (a floating point multiply-add operation). The peak off-chip memory (DRAM) bandwidth of the processor is 128 GB/s. What is peak single-precision floating point throughput of the processor in Floating Point Operations Per Second (FLOPS)? The following matrix multiplication kernel is used to calculate P=M\*N on the processor for 16x16 square matrices M and N. Each core runs the same kernel code for a different row and column of the output matrix. In other words, each core computes one element of the product matrix (P). Given the performance constraints listed above, what is the peak performance in FLOPS attainable by the kernel? Assume that only loads from off-chip memory (DRAM) count toward off-chip memory bandwidth usage (don't count the storing of the results). M and N are stored in off-chip memory (DRAM). ``` void MatrixMulKernel(float* M, float* N, float* P, int Row, int Col) { // Row is the row index of the product (P) element // computed by this core // Col is the column index of the product (P) element // computed by this core float Pvalue = 0; // each thread performs a scalar product to compute // one element of the product for(int k = 0; k < 16; ++k) Pvalue += M[Row*16+k] * N[k*16+Col]; P[Row*16+Col] = Pvalue; } ``` Peak performance for the kernel =? The many-core processor contains a fast on-chip memory. You decide to re-write the kernel as shown below to reduce the off-chip memory bandwidth usage in hopes of improving performance. What is the peak performance in FLOPS attainable by the new kernel? Again, assume that only loads from off-chip memory (DRAM) count toward off-chip memory bandwidth usage (don't count the storing of the results). M and N are stored in off-chip memory (DRAM), but M\_on and N\_on reside in on-chip memory. (The qualifier \_onchip\_ declares memory in the on-chip memory structure.) In the new kernel, the cores cooperate to load the input data from offchip memory into on-chip memory, then perform the matrix multiplication. ``` void MatrixMulKernel(float* M, float* N, float* P, int Row, int Col) { // declare memory in the on-chip storage __onchip__ float M_on[16][16]; onchip float N on[16][16]; float Pvalue = 0; // Collaborative loading of M and N into on-chip memory M_{on}[Row][Col] = M[Row*16 + Col]; N on[Row][Col] = N[Row*16 + Col]; // wait for all cores to finish loading // into the on-chip memory Synchronize cores(); // compute the product element for this core for(int k = 0; k < 16; ++k) Pvalue += M_on[Row*16+k] * N_on[k*16+Col]; P[Row*16+Col] = Pvalue; } ``` Peak performance for the new kernel = ?