MIPS assembly code assignment

Q1 (3 points) : For this problem, we will use the following loop:

for (k=n; k>=0; k--) x[k] = y[k+1] – 2.0 * y[k];

If we assume:

R1 contains the address of the nth element of y
R2 contains the address of the nth element of x
F0 contains 2.0

The above code could be written as the following MIPS assembly code:

Loop: LD	F2,	8(R1)
LD	F4,	0(R1)
MULTD	F6,	F4, F0
SUBD	F8,	F2, F6
SD	F8,	0(R2)
SUBI	R1,	R1, #8	#same as DADDUI R1, R1, #-8
SUBI	R2,	R2, #8
BNEZ	R2,	Loop

Using the following table for instruction latencies

Instruction/Operation Type	Latency in Clock Cycles
Double Load	1
Double Store	0
FP Multiply	5

Show the (cycle) schedule, including stalls of the unmodified loop on a fully pipelined machine.

Cycle Instruction/stall

Unroll the loop 3 times but do not reschedule the instructions. Ignore the delay slot. Do not delete any instructions other than loop overhead instructions.

Cycle Instruction/stall

Unroll the loop 3 times and reschedule the instructions to reduce the number of stalls. Ignore the delay slot. Do not delete any instructions other than loop overhead instructions.

Cycle Instruction/stall

What is the speedup of the unrolled loop in (2) from the unmodified case in (1)? What is the speedup of the unrolled and scheduled loop in (3) from the unmodified case in (1). Please show your calculation

Q2 : (7 points) Show scheduling of the following code: L.D F2, 0(R2)

L.D F4, 100(R3)

ADD.D F8, F2, F2

MUL.D F6, F4, F8

SUB.D F6, F2, F4

(3 points) Using scoreboard. Assume one integer ALU, two FP multipliers, one FP adder and one FP divider. Integer ALU takes one execution cycle, FP multipliers take 7 cycles, FP adder takes 4 cycles and FP divider

takes 25 cycles.

(3 points) Using Tomasulo’s algorithm. Assume two

LOAD units, two FP multipliers and three FP adders. Load unit takes one execution cycle for address calculation and a second one for memory access, FP

multipliers take 7 cycles and FP adders take 4 cycles.

(1 point) Comment what structure (Scoreboard or Tomasulo’s) provides shorter execution time and why

(how are the sources of slowdown in one

structure avoided by the “better” structure).