# Speculations about Computer Architecture in Next Three Years

shuchang.zhou@gmail.com Jan. 20, 2018

#### About me

#### https://zsc.github.io/

| <ul> <li>Source-to-source<br/>transformation</li> <li>Cache simulation</li> </ul> | <ul> <li>Natural Language<br/>Question &amp; Answer</li> <li>Indoor Navigation<br/>with INS</li> <li>Group Orbit<br/>Optimization</li> </ul> | <ul> <li>OCR</li> <li>Quantized Neural<br/>Network</li> <li>Smart Camera</li> <li>Reinforcement<br/>Learning</li> </ul> |      |
|-----------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------|------|
| Compiler Optimization                                                             | Machine Learning                                                                                                                             | Neural Network                                                                                                          |      |
| 2007 2008 2009 2010 2                                                             | 011 2012 2013 2014                                                                                                                           | 4 2015 2016 2017                                                                                                        | 2018 |





## **Deep Learning Revolution in Vision & Speech**



## Deep Learning Revolution in Vision & Speech



# Implications of Deep Learning

- Unification of Algorithms in Vision & Speech
  - Deep Learning v.s. "Traditional methods"
- Graph execution engine as the new platform
  - For CNN / RNN
- A new wave of data centers
  - Google / Facebook: millions of GPU
  - Startups: thousands of GPU
- Adjoints of Neural Networks
  - Image augmentor
  - Simulators

## **Computation Stack**



**Timing Closure** 

## **Computation Stack**



#### How will this stack deal with changes?

#### Case study: Large Neural Networks

Characteristics: many channels + side-branches + many layers



# Case study: Large Neural Networks

On-Chip-Memory for caching feature maps



Systolic Array

Large page-table

able Auto-SIMD

C++ Source

C++ Parser

Static analysis + dynamic profiling for kernel selection + execution plan



Computation Graph Engine



#### Case study: Small Neural Networks

Characteristics: few channels + 1x1 convolutions



Lack of shortcut hurts its transfer learning ability.

The unique shuffle operation slows its adoption.



Auto-SIMD

#### Case study: Small Neural Networks

On-Chip-Memory may be more important.



• Different batching



Lower overhead

Computation Graph Engine

Fusion of layers

+ handcrafted

kernels

# When a Neural Network Designers, a Computer Architect, a Compiler Expert and an OS Guru meet

#### • Designer wants

- A reliable performance model
  - Open architecture design and assembly/microcode level exposure
- Better profilers for runtime diagnostics and analyzers
- Support for sparse matrices, dynamic operations

#### Architect wants

- Batch operations with constant delays
- Regular memory access pattern subject to locality and many reuses
- Streamlined memory/computation usage, no overwhelming peaks
- Less number of operators
- Compiler Expert and OS Guru wants
  - To broker between the Designer and the Architect
    - Have a slow fallback for bizarre operators
    - Cutting peaks

# Case study: Quantum Computing Simulator on FPGA Clash/FPGA: implement Complex Number

```
type CC = Vec 2 RR
c0 = 0 :> 0 :> Nil
c1 = 1 :> 0 :> Nil
sqr norm :: CC -> RR
sqr norm (a :> b :> Nil) = a * a + b * b
cadd :: CC \rightarrow CC \rightarrow CC
cadd = zipWith (+)
cmul :: CC \rightarrow CC \rightarrow CC
cmul (a :> b :> Nil) (c :> d :> Nil) = (a * c - b * d) :> (a * d + b * c) :>
Nil
dotProduct xs ys = foldr cadd c0 (zipWith cmul xs ys)
```

matrixVector m v = map (`dotProduct` v) m

#### Case study: Quantum Computing Simulator on FPGA

HLS may be sufficiently efficient and flexible



#### A possible future

# **Design Silicon Compiler!**

# How Google and Qualcomm exploit real world HLS and HLV

By Paul Dempsey | <u>No Comments</u> | Posted: June 1, 2016 Topics/Categories: <u>IP - Assembly & Integration</u>, <u>Design Management</u>, <u>EDA - ESL</u>, <u>IC Implementation</u>, <u>Verification</u> | Tags: <u>high level</u> <u>verification</u>, <u>high-level synthesis (HLS)</u>, <u>hls</u>, <u>hlv</u> | Organizations: <u>Google</u>, <u>Mentor Graphics</u>, <u>Qualcomm</u>

By taking a pragmatic approach, the two technology giants have comfortably adopted high-level synthesis and verification – and have shared their experiences.

#### Case study: Reinforcement Learning

#### Characteristics: require fast & complex simulations





A human skeleton model for locomotive task modeling.

GTA 5 AirSim



Simulation for self-driving car/ADAS and Drones.

# Case study: Reinforcement Learning

Typical CPU load, but need to integrate with Neural Network Accelerator



#### A possible future

## **Revival of Compiler Optimizations!**



Should we prepare a benchmark of simulators?

# The Age of Instant Response

- Old School
  - Compiler cannot change code
  - Developer as the dictator
  - Batch operation and buffering
  - Conference & Journal
- New School
  - Compiler can offer suggestions
  - User Community
    - User code contributions
    - Peer-to-peer helping
  - Low latency is critical
  - Arxiv & http://www.arxiv-sanity.com/

#### NUMBER OF YEARS IT TOOK FOR EACH PRODUCT TO GAIN 50 MILLION USERS:

| Airlines | Automobiles | 5 Telephone | Electrici | ty Credit | Card 1   | Television | ATM        |
|----------|-------------|-------------|-----------|-----------|----------|------------|------------|
| X        |             | 62          |           | ) 🕞       |          |            | <b>T\$</b> |
|          |             | 9           | ~~        |           |          |            | -          |
| 68угs    | 62угз       | 50yrs       | 46yrs     | s 28y     | rs 2     | 22yrs      | 18yrs      |
|          |             |             |           |           |          |            |            |
| Computer | Cell Phone  | Internet    | iPods     | Youtube   | Facebook | Twitter    | Pokémon Go |
|          |             | A           |           |           | 5        | 2 m        |            |
|          | Ļ           |             | 0         |           |          | 3          | 0          |
| 14yrs    | 12yrs       | 7угѕ        | 4yrs      | 4yrs      | Зугѕ     | 2yrs       | 19 days    |

#### The combined future ...

#### Performance critical



Circuit to Generate Fibonacci Series (Fig. 13)

#### Backup after this slide