#### GHC on the OpenSPARC T2

Ben Lippmeier Australian National University Haskell Implementors Workshop 2009/08/05

- Funded by Sun Microsystems.
- Organised by:
  - Duncan Coutts, Roman Leshchinskiy, Darryl Gove.
- Make GHC work on SPARC (again)
- Why do we care?





# Multicore !!!

(shared memory symmetric multi-processing)

### The OpenSPARC T2



- Released Oct 2007
- 8 cores / processor
- 8 threads / core
- 64 threads / processor
- 4 MB L2 Cache
   16 way associative.
- 1165 MHz

#### One T2 Core



- Hardware per core:
   2 x ALU (Integer + Address)
   1 x FPU (Floating Point)
   1 x LSU (Load Store Unit)
- 8 stage integer pipeline
- 12 stage floating point pipeline
- No out-of-order execution
- No exploitation of instruction level parallelism (ILP)

#### One T2 Core



- Each thread has its own register set.
- Two instructions can be dispatched per cycle, each from different threads.
- Threads are intended to stall frequently.
- All threads on a core share the same L1 Cache.

#### OpenSPARC T2

## 1165 MHz \* 2 instrs/core \* 8 cores = 18.64 Gig instrs / s (in order)

Intel Core2 Duo

1600 MHz \* 4 instrs/core \* 2 cores = 12.80 Gig instrs / s (out of order)

#### Out-of-order execution doesn't help us much...

| ld<br>st<br>st | [%i0+4],<br>%g1,<br>%l2,     | [% <b>i3-</b> 12]<br>[% <b>i3-</b> 8] | <ul> <li>Lots of memory traffic<br/>=&gt; Lots of cache miss</li> </ul> |
|----------------|------------------------------|---------------------------------------|-------------------------------------------------------------------------|
|                | [% <b>i0</b> +12],           |                                       |                                                                         |
| st<br>st       | % <b>g1,</b><br>% <b>11,</b> |                                       | <ul> <li>Not much ILP</li> </ul>                                        |
| add            | % <b>i3,</b> −24,            | % <b>g1</b>                           | (Instr Level Parallelism)                                               |
| st             | % <b>g1,</b>                 | [% <b>i0</b> +12]                     |                                                                         |
| ld             | [% <b>i0</b> +8],            | 811                                   |                                                                         |
| sethi          | %hi(s1rX_                    | info) <b>, %g1</b>                    |                                                                         |
| or             | % <b>g1,</b>                 | %lo(s1rX_in                           | fo), % <b>g</b> 1                                                       |
| st             | % <b>g1,</b>                 | [% <b>i0</b> +8]                      |                                                                         |
| add            | % <b>iO</b> , 8,             | 8 <b>i0</b>                           |                                                                         |
| and            | <b>%11,</b> 3,               | % <b>g1</b>                           |                                                                         |
| cmp            | % <b>g1,</b>                 | 0                                     |                                                                         |
| bne            | .LclUn                       |                                       |                                                                         |

#### Fixing the Native Code Generator

- GHC has had native code generation for
  - x86
  - x86\_64
  - Power PC
  - SPARC
  - Alpha
- All mashed into one module "MachCodeGen.hs"
- Support for various architectures has grown organically.
- Target architecture selected by a series of #ifdefs

#### #if i386\_TARGET\_ARCH || x86\_64\_TARGET\_ARCH

```
#if sparc_TARGET_ARCH
getRegister (CmmLit (CmmFloat f W32))
  = do ...
```

- Hard to work on code for one platform without breaking others.
- All code for all platforms should be compiled all the time.
- Code for SPARC and PPC is now split into its own modules.
- Still need to untangle x86 from x86\_64.
- Move towards being a cross-compiler, and eliminate dependency on GCC for bootstrapping.

#### The Instruction class

|                               | <pre>instr where :: instr -&gt; RegUsage :: instr -&gt; (Reg -&gt; Reg) -&gt; instr</pre>                                                      |
|-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|
| ±                             | <pre>:: instr -&gt; Bool<br/>:: instr -&gt; [BlockId]<br/>:: instr -&gt; (BlockId -&gt; BlockId)-&gt; instr<br/>:: BlockId -&gt; [instr]</pre> |
| mkSpillInstr<br>mkLoadInstr   | <pre>:: Reg -&gt; Int -&gt; Int -&gt; instr<br/>:: Reg -&gt; Int -&gt; Int -&gt; instr</pre>                                                   |
| takeDeltaInstr<br>isMetaInstr | :: instr -> Maybe Int<br>:: instr -> Bool                                                                                                      |
|                               | :: Reg -> Reg -> instr<br>:: instr -> Maybe (Reg, Reg)                                                                                         |

Benchmarking





- Embarrassingly parallel benchmark.
- Use Intel processors as the baseline.
- Almost linear speedup until we run out of hardware threads.
- No point using more Haskell threads than hardware threads.



- Embarrassingly parallel benchmark.
- Use Intel processors as the baseline.
- Almost linear speedup until we run out of hardware threads.
- No point using more Haskell threads than hardware threads.



- Embarrassingly parallel benchmark.
- Use Intel processors as the baseline.
- Almost linear speedup until we run out of hardware threads.
- No point using more Haskell threads than hardware threads.

#### partree: runtime(s) vs number of threads



#### partree: runtime(s) vs number of threads



- Not very parallel.
- Tiny speedups on Intel.
- No real speedup with more than 3 threads.
- Can't make full use of a whole T2 core.

#### partree: runtime(s) vs number of threads



- Not very parallel.
- Tiny speedups on Intel.
- No real speedup with more than 3 threads.
- Can't make full use of a whole T2 core.

**Benchmarking Summary** 

# If you have less than 8 threads of work then stay home.





**Benchmarking Summary** 

# If you have less than 8 threads of work then stay home.





It's a "throughput" machine.

#### sumeuler: issue rate, data miss rate (Gig/s) vs time(s)



#### matmult: issue rate, data miss rate (Gig/s) vs time(s)





- Periods of high and low parallelism.
- Large variation run-to-run.
- Threads spend time blocked at join points?
- Can ThreadScope help debug this?

- We need more satisfying benchmarks.
- We haven't had 64 hardware threads before.
- Use ThreadScope to determine why matmult is behaving badly.
- Some simple compile-time instruction reordering could help.
   No out-of-order execution => pipeline stalls.
- Keep the build working!!



#### http://ghcsparc.blogspot.com