## GHC on the OpenSPARC T2

Ben Lippmeier Australian National University FP-SYD 2009/08/20

```
result :: [Integer]
result = map (ack 2) [1..200]
```

```
ack :: Integer -> Integer -> Integer
ack 0 n = n + 1
ack n 0 = ack (n-1) 1
ack n m = ack (n-1) (ack n (m - 1))
```

```
import Control.Parallel.Strategies
result :: [Integer]
result = map (ack 2) [1..200]
                `using` parList rwhnf
ack :: Integer -> Integer -> Integer
ack 0 n = n + 1
ack n 0 = ack (n-1) 1
ack n m = ack (n-1) (ack n (m - 1))
```

- Add parallel combinators => program runs faster.
- All locking / scheduling / work balancing done by RTS.

## GHC parallel evaluation model



- A capability is a thread of the GHC runtime system that can evaluate parts of the program. There is one capability per CPU/hardware thread.
- Capability 0 holds the main thread, which evaluates the main routine.

## GHC parallel evaluation model



- A spark represents an expression in the program that could be potentially evaluated in parallel.
- Sparks are created with the primitive Haskell operator par

## GHC parallel evaluation model



- When a Capability is idle, a spark is taken from the spark pool and made into a running thread.
- All we need to do is to create sparks, and specify the number of capabilities to use. Great for irregular parallelism!

# The OpenSPARC T2: Released October 2007



- 8 cores / processor.
  8 threads / core.
  - = 64 threads / processor
- Hardware per core:
   + 2 ALUs
  - + 1 Load/Store Unit
  - + 1 FP Unit
- In each cycle a core can dispatch 2 instructions.
- Threads on the same core share the same L1 cache.

# OpenSPARC T2 peak issue rate (in order)



## Intel Core2 Duo peak issue rate (out of order)

- Peak instruction issue rate:
  - 1600 Meg cycles / sec
  - 4 instructions / core
  - \* 2 cores / processor
  - = 12.80 Gig Instr/s

\*

## Out-of-order execution doesn't help us much...

| sethi | %hi(s1p9_           | info) <b>, %g1</b> |                                                |
|-------|---------------------|--------------------|------------------------------------------------|
| or    | % <b>g1,</b>        | %lo(s1p9_info),    | 8 <b>g1</b>                                    |
| st    | % <b>g1,</b>        | [% <b>i3-</b> 24]  |                                                |
| ld    | [% <b>i0</b> +8],   | % <b>g1</b>        |                                                |
| st    | % <b>g1,</b>        | [% <b>i3-</b> 16]  |                                                |
| ld    | [% <b>i0</b> +4],   | ି <b>g1</b>        | • Lote of momony traffic                       |
| st    | % <b>g1,</b>        | [% <b>i3-</b> 12]  | <ul> <li>Lots of memory traffic</li> </ul>     |
| st    | 8 <b>12,</b>        | [% <b>i3-</b> 8]   | => Lots of cache miss                          |
| ld    | [% <b>i0</b> +12],  | 8 <b>g1</b>        |                                                |
| st    | % <b>g1,</b>        | [% <b>i3-</b> 4]   |                                                |
| st    | 8 <b>11,</b>        | [% <b>i3</b> ]     | <ul> <li>Not much Instruction Level</li> </ul> |
| add   | % <b>i3,</b> −24,   | % <b>g1</b>        | Parallelism (ILP)                              |
| st    | % <b>g1,</b>        | [% <b>i0</b> +12]  |                                                |
| ld    | [%i0+8],            | 811                |                                                |
| sethi | %hi(s1rX_info), %g1 |                    |                                                |
| or    | -                   | %lo(s1rX_info),    | %g1                                            |
| st    | %g1,                | [%i0+8]            |                                                |
| add   | %i0, 8,             | %i0                |                                                |
| and   |                     | %gl                |                                                |
| cmp   | %g1,                | 0                  |                                                |
| bne   | .LclUn              |                    |                                                |

# Project: Make GHC work on the OpenSPARC T2

- Project funded by Sun Microsystems.
  - Organised by Duncan Coutts, Roman Leshchinskiy, Darryl Gove.
- As of 1st Jan 2009, GHC did not build at all on SPARC.
- Step1: Fix the via-C build.
  - No buildbots for SPARC.
  - Existing SPARC build was entirely community supported.
- Step2: Fix the Native Code Generator
  - SPARC NCG hadn't worked for years.
  - Badly in need of cleaning up and refactoring.
- Step 3: Benchmarking and Tuning

## Benchmarking on the T2



## Benchmarking on the T2



partree 300 100 Elapsed time. Average of 5 runs.

#### Benchmarking on the T2





# If you have less than 8 threads of work, then stay home.

## Instruction counts on Pentium M vs SPARC T2





8 threads on 1 core vs 1 thread per core

#### Thread activity for sumeuler benchmark



## Thread activity for matmult benchmark



- Try to rewrite benchmarks to expose more parallelism. Until now we haven't been dealing with 64 hardware threads.
- Use ThreadScope to determine why we have periods of low activity in benchmarks like matmult.
- Some simple compile-time instruction reordering could help. The T2 core does no runtime reordering => pipeline stalls.
- Keep the build working!!