Solution 1 Bandwidth intensive application A runs on computer C A s performance in ops s on C equals the interconnection network s bisection bandwidth in
Solution Bandwidth intensive application A runs on computer C A s performance in ops s on C
Solution Bandwidth intensive application A runs on computer C A s performance in
A runs on computer C A s performance in ops s on C equals the interconnection network s bisection bandwidth in
Solution Bandwidth intensive application A runs on computer C A s
performance in ops s on C equals the interconnection network s bisection bandwidth in
Solution Bandwidth intensive application A runs on computer
Solution Bandwidth intensive application
(Solution) 1 ) Bandwidth-intensive application A runs on computer C. A's performance in ops/s on C equals the interconnection network's bisection bandwidth in...

 Category: General Words: 1050 Amount: \$12 Writer:

Paper instructions

1 )Bandwidth-intensive application A runs on computer C. A’s performance in ops/s on C equals the interconnection network’s bisection bandwidth in words/s. During execution, the processor aggregate dissipates 25 MWs, the memory aggregate dissipates 20 MWs, and the interconnection network dissi- pates a whopping 160 MWs. As a rule, for this network, power grows as (bisection bandwidth)^1.5. Appalled by the power consumption, the designers build computer D with half of C’s bisection band- width. By what factor has D improved on C’s energy efficiency, i.e., its figure of merit ops/J? K = 2^102)The 'M' machine is a memory-memory architecture.  For example, its floating-point multiply instruction is: 'mmul.d a,b,c', meaning "take the 64-bit floating-point values starting at memory addresses 'b' and 'c', respectively, multiply them, and store the 64-bit floating-point result starting at memory address 'a'".  The 'M' machine is implemented by a smaller, embedded 'J' machine.  A program translates each 'M' machine instruction into one or more 'J' machine instructions, listed below.lw     r1,a      // load 32 bits starting at 'a'sw     r1,a      // store 32 bits starting at 'a'pack   f0,r1,r2  // pack two 'r' registers into one 'f' registerunpack f0,r1,r2  // unpack one 'f' register into two 'r' registersmul.d  f0,f2,f4  // perform floating-point multiply f2 * f4Write a J-machine program that implements 'mmul.d a,b,c'.3) Imagine a computer with no cache, but with a reasonable-size register file. The computer has a single floating-point multiplier. Theeffect of these assumptions is that each floating-point multiply (operation) will, with probability 1, find one of its two operands in the register file, but will need its other operand delivered from memory, and this for each floating-point multiply. Let the floating-point multiplier have a peak performance of 16 GFs/s. At present, the achievable bandwidth from the memory to the processor is 7 GWs/s. (Here, 'W' stands for word, not Watt, and, by assumption, one word can hold one floating-point value). K = 10^3a) [5 marks] Describe this situation as _compute bound_ or _bandwidthbound_.b) [5 marks] We buy a second DRAM module and more interconnection links withthe same aggregate capacities as the first. Describe this new situation as_compute bound_ or _bandwidth bound_.c) What would it take to achieve sustained performance equal to the peakperformance?