# MIPS assembly code assignment

Q1 (3 points) : For this problem, we will use the following loop:

for (k=n; k>=0; k--) x[k] = y[k+1] – 2.0 * y[k];

If we assume:

• R1 contains the address of the nth element of y
• R2 contains the address of the nth element of x
• F0 contains 2.0

The above code could be written as the following MIPS assembly code:

 Loop: LD F2, 8(R1) LD F4, 0(R1) MULTD F6, F4, F0 SUBD F8, F2, F6 SD F8, 0(R2) SUBI R1, R1, #8 #same as DADDUI R1, R1, #-8 SUBI R2, R2, #8 BNEZ R2, Loop

Using the following table for instruction latencies

 Instruction/Operation Type Latency in Clock Cycles Double Load 1 Double Store 0 FP Multiply 5
• Show the (cycle) schedule, including stalls of the unmodified loop on a fully pipelined machine.

Cycle Instruction/stall

• Unroll the loop 3 times but do not reschedule the instructions. Ignore the delay slot. Do not delete any instructions other than loop overhead instructions.

Cycle Instruction/stall

• Unroll the loop 3 times and reschedule the instructions to reduce the number of stalls. Ignore the delay slot. Do not delete any instructions other than loop overhead instructions.

Cycle Instruction/stall

• What is the speedup of the unrolled loop in (2) from the unmodified case in (1)? What is the speedup of the unrolled and scheduled loop in (3) from the unmodified case in (1). Please show your calculation

Q2 : (7 points) Show scheduling of the following code: L.D F2, 0(R2)

L.D F4, 100(R3)

MUL.D F6, F4, F8

SUB.D F6, F2, F4

1. (3 points) Using scoreboard. Assume one integer ALU, two FP multipliers, one FP adder and one FP divider. Integer ALU takes one execution cycle, FP multipliers take 7 cycles, FP adder takes 4 cycles and FP divider

takes 25 cycles.

1. (3 points) Using Tomasulo’s algorithm. Assume two

LOAD units, two FP multipliers and three FP adders. Load unit takes one execution cycle for address calculation and a second one for memory access, FP

multipliers take 7 cycles and FP adders take 4 cycles.

1. (1 point) Comment what structure (Scoreboard or Tomasulo’s) provides shorter execution time and why

(how are the sources of slowdown in one

structure avoided by the “better” structure).