多核技术与编程

http://www.njyangqs.com/

## Multi-core Architecture and Programming

Yang Quansheng(移全胜) http://www.njyangqs.com/

# Southeast University

1

Science & Engineering

学



http://www.njyangqs.com/

## **Introduce of Multi-Core**

### Content

Why we need Multi-core?
What is Multi-core?
Multi-core Architecture
Intel® Dual-core introduction
Challenge

Science & Engineering

School of Compu

Southeast University

- When computer was born, a program is a sequence of instructions.
- In the 1960s, the technology called concurrency let Multi-users could access a single mainframe computer simultaneously.
- In the early days of personal computers, the OS are called single-user operating system and only one program would run at a time.

3

多核技术与编程

http://www.njyangqs.com/

## Why we need Multi-core

Recently

- More and more task want to execute simultaneously.
- Each task (the applications) become more and more complex
- The internet become more and more pervasive
- User's high performance processor expectation



- Parallel computing is the best way for the high performance expectation
  - Computation requirements are ever increasing
    - visualization, distributed databases, simulations, scientific prediction (earthquake), large scale network activities, etc.
  - Silicon based (sequential) architectures reaching physical limits in processing limits as they are constrained by:
    - the speed of light
    - Thermodynamics

Southeast University School of Computer Science & Engineering 东南大学 计算机 科学与工程学院



- Parallel computing is the best way for the high performance expectation
  - Hardware improvements like pipelining, superscalar are not scaling well and require sophisticated compiler technology to exploit performance out of them.
  - Techniques such as vector processing works well for certain kind of problems.



- Parallel computing is the best way for the high performance expectation
  - Significant development in networking technology is paving a way for networkbased cost-effective parallel computing.
    - InifiBand
      - Today: 10,20Gb/s node-to-node
      - 30,60Gb/s switch-to-switch



Parallel computing is the best way for the high performance expectation Parallel computing ~one of the best ways to overcome the speed bottleneck of a single processor good price/performance ratio of a small cluster-based parallel computer Parallel processing technology is mature and is being exploited commercially.



#### A brief history of micro-architecture evolution We are Pipeline, in order noita here beline, 3 bits data 64 bits 4 bits Multi Dual uperscala redicatio data VLIM core orde data Cor core SM

院

CMP

School of Computer Science & Engineering 计 算 机 科 程

Southeast University

东南大学

9

- How to improve the performance of the processor
  - Delivered Performance = Frequency \* Instructions Per Cycle (IPC)
  - Improving the main frequence is the traditional method, but is led huge heat.
  - Multi-core is the another way which can increase the IPC.

Southeast University



### The benefits of Multi-core

- Better responsiveness
- Higher multithreaded throughput
- Benefits of parallel computing in mainstream applications
- enhance user experiences in multitasking environments

Southeast University School of Computer Science & Engineering 东南大学 计算机 学与工程学院



## **Introduce of Multi-Core**

#### Content

- Why we need Multi-core?
  What is Multi-core?
- Multi-core Architecture
- Intel<sup>®</sup> Dual-core introduction
- Challenge

http://www.njyangqs.com/

## What is Multi-core

- a single processor package that contains two or more processor core
- Two or more processor in a single die
- Isomorphic Multi-core vs. isomerous Multi-core
- Share the Cache vs. interconnect on chip

Concurrency and Parallel computing

- Concurrency: in software is a way to manage the sharing of resources used at the same time.
- Parallel computing: involves the simultaneous use of more than one computer or processor to execute a program. Ideally, parallel processing makes a program run faster because there are more engines (CPUs) running it. Even single-core processor computers can perform parallel processing by connecting to other computers in a network.

Simultaneous multithreading (SMT)

- Permits multiple independent threads to execute SIMULTANEOUSLY on the SAME core
- Weaving together multiple "threads" on the same core
- Example: if one thread is waiting for a floating point operation to complete, another thread can use the integer units

多核技术与编程

### Parallelism

#### Instruction-level parallelism

- Parallelism at the machine-instruction level
- The processor can re-order, pipeline instructions, split them into micro-ops, do aggressive branch prediction, etc.
- Instruction-level parallelism enabled rapid increases in processor speeds over the last 15 years

多核技术与编程

### Parallelism

#### Thread-level parallelism (TLP)

- This is parallelism on a more coarser scale
- Server can serve each client in a separate thread (Web server, database server)
- A computer game can do AI, graphics, and physics in three separate threads
- Single-core superscalar processors cannot fully exploit TLP
- Multi-core architectures are the next step in processor evolution: explicitly exploiting TLP

多核技术与编程

### Multi-core processor is a special kind of a multiprocessor

- All processors are on the same chip
- Multi-core processors are MIMD: Different cores execute different threads (Multiple Instructions), operating on different parts of memory (Multiple Data).

18

Multi-core is a shared memory multiprocessor:

School of Computer Science & Engineering

**Southeast University** 

All cores share the same memory

多核技术与编程

What applications benefit from multi-core?

Database servers

School of Computer

机

t

南大学

- Web servers (Web commerce)
- Compilers
- Multimedia applications
- Scientific applications, CAD/CAM

Science & Engineering

In general, applications with Thread-level parallelism (as opposed to instruction-level parallelism)

19



## **Introduce of Multi-Core**

### Content

Why we need Multi-core?
What is Multi-core?
Multi-core Architecture
Intel® Dual-core introduction
Challenge

Southeast University School of Computer Science & Engineering \_\_\_\_东南\_大学 计算机 科学与工程学院

### INTEL CORE DUO

- Two physical cores in a package
- Each with its own execution resources
- Each with its own L1 cache
  - 32K instruction and 32K data
- Both cores share the L2 cache
  - 2MB 8-way set associative;
     64-byte line size
     10 clock cycles latency; Write

21

**Back update policy** 

Southeast University School of Computer Science & Engineering 东南大学 计算机 科学与工程学院



多核技术与编程

#### AMD Opteron

SRQ

计

算

机

Southeast University

东南大学

- Separate 1 MbyteL2 caches
- CPUO and CPU1 communicate through the





# Niagara from SUN 8 CPU core

Each core can run 4 thread simultaneity



In Niagara2, Each core can run 8 thread simultaneity

多核技术与编程

### Cell from IBM, Toshiba and Sony

- ◆ 1 PPE (Power Processor Element)
- ◆8 SPE (Synergistic Processor Element)



**Southeast University** School of Computer Science & Engineering t 机

东南大学



## **Introduce of Multi-Core**

### Content

Why we need Multi-core?
What is Multi-core?
Multi-core Architecture
Intel<sup>®</sup> Dual-core introduction
Challenge

Southeast University School of Computer Science & Engineering \_\_\_\_东南\_大学 计算机 科学与工程学院

## **Intel<sup>®</sup> Dual-core introduction**

### Key Features

Two physical cores in a package

Each core with its own execution resources

Each core with its own L1 cache

32K instruction and 32K data

- Both cores share the L2 cache
  - **~ 2MB**

러

- IO clock cycles latency
- Write Back update policy
- What is write back cache policy ?

## **Intel<sup>®</sup> Dual-core introduction**

#### New technique

- Wide Dynamic Execution
- Advanced Digital Media Boost
- Smart Memory Access
- Advanced Smart Cache
- Intelligent Power Capability

机

**Southeast University** 

东

南大学



#### <sup>多核技术与编程</sup> Intel<sup>®</sup> Dual-core introduction

- Wide Dynamic Execution
  - <u>Start with Instruction</u>
     <u>Fetch</u>
  - four(+) instructions /
     cycle
  - >33% increase over other x86 processors
  - Instructions converted to micro-ops (uops)

School of Computer Science & Engineering

~1 uopper ×86
 instruction

机

**Southeast University** 

东南大学



28

#### <sup>多核技术与编程</sup> Intel<sup>®</sup> Dual-core introduction



School of Computer Science & Engineering

机

**Southeast University** 

东南大学





## **Intel<sup>®</sup> Dual-core introduction**

### Wide Dynamic Execution

Delivered Performance = Frequency \* Instructions Per Cycle (IPC)

Power = C<sub>dynamic</sub>\* V \* V \* Frequency

Fewer uops per instruction allows IPC to be increased while lowering C dynamic (less bits and less toggling)

Southeast University School of Computer Science & Engineering 东南大学 计算机 学与工程学院

## **Intel<sup>®</sup> Dual-core introduction**

- Techniques for Micro-op Reduction
  - ESP Tracker (Extended Stack Pointer)

Science & Engineering

- Execute Stack Pointer updates in dedicated hardware
- ☞ Intel Intel<sup>®</sup> Core<sup>™</sup> microarchitecture increases Band Width 33%\*
- Micro-Op Micro-Fusion
  - Single Uop representation of "multi-uop" instruction
     Intel Intel<sup>®</sup> Core<sup>™</sup> microarchitecture increase # instructions\*
- Macro-Fusion

School of Computer

Sew technique in Intel<sup>®</sup> Core<sup>™</sup> microarchitecture

31

http://www.njyangqs.com/

## **Intel<sup>®</sup> Dual-core introduction**

#### Macro-Fusion

- Represent common x86 instruction pairs in single micro-op
  - CMP or TEST + Conditional Branch (Jcc)
- Enhanced Arithmetic Logic Unit (ALU) for macro-fusion
  - Single dispatch efficiency
  - Single cycle execution performance

南大学

#### <sup>多核技术与编程</sup> Intel<sup>®</sup> Dual-core introduction

### Intel<sup>®</sup> Advanced Digit Media Boost



2M/4M

shared L2

Cache

up to

## **Intel<sup>®</sup> Dual-core introduction**

34

Intel<sup>®</sup> Smart Memory Access Instruction Fetch and PreDecode Memory Disambiguation Instruction Queue Improved Prefetchers uCode Deedle The Goal ROM - WHEN - Ensure data Rename/Alloc can be used as early as possible Retirement Unit (ReOrder Buffer) **WHERE - Ensure user** of data has it as close AL U as possible FAdd Branch MMX/SSE MMX/SSE FPmove FPmove

School of Computer Science & Engineering

**Southeast University** 

东南大学

러

机



## **Intel<sup>®</sup> Dual-core introduction**

Intel<sup>®</sup> Smart Memory Access

- Memory Disambiguation Solving the Problem of WHEN
  - Loads can decouple from Stores
  - Loads that are predicted NOT to forward from preceding store are allowed to schedule as early as possibleas
    - increasing the performance of OOO memory pipelines
  - Disambiguated loads checked at retirement
    - Extension to existing coherency mechanism

35

Invisible to software and system

Science & Engineering

Southeast University

School of Computer

http://www.njyangqs.com/

## **Intel<sup>®</sup> Dual-core introduction**

Intel<sup>®</sup> Smart Memory Access

- Improved Prefetchers Solving the Problem of WHERE
  - Memory is too far away
  - Caches are closer when they have the data
  - Prefetchers detect applications data reference patterns
  - And bring the data closer to data consumer

http://www.njyangqs.com/

## **Intel<sup>®</sup> Dual-core introduction**

Advanced Smart Cache

- L2 can adapt to each core's load
  Fast data sharing
- No replicated data
- 2X Band Width to L1 caches



## **Intel<sup>®</sup> Dual-core introduction**

#### Intelligent Power Capability

- Ultra Fine Grained Power Control
  - Even during periods of high performance execution, many parts of the chip core can be shut off.
- Split Busses
  - By splitting buses to deal with varying data widths, we can gain the performance benefit of bus width while maintaining C dynamic closer to thinner buses.

#### Platformization of Power Management Architecture

- PSI-2 Power Status Indicator (Mobile)
- DTS Digital Thermal Sensors
- PECI Platform Environment Control Interface

east University School of Computer Science & Engineering 南大学 V 计算机 科学与工程学院 38



## **Introduce of Multi-Core**

### Content

- Why we need Multi-core?
- What is Multi-core?
- Multi-core Architecture
- Intel<sup>®</sup> Dual-core introduction
- Challenge

Southeast University School of Computer Science & Engineering \_\_\_\_东南\_大学 计算机 科学与工程学院

## Challenge

Challenge to the programmer

- From a single-core programmer to a Multicore programmer
- Increase the program efficiency by using the Multi-core processor
- Challenge to the parallel programming
  - Synchronization
  - Communication
  - Load Balancing
  - Scalability

neast University School of Computer Science & Enginee 南大学 计算机科学与工程学