# **Inhale**: Enabling High-Performance and Energy-Efficient In-SRAM Cryptographic Hash for IoT

Jingyao Zhang, Elaheh Sadredini

2022 IEEE/ACM International Conference on Computer-Aided Design



#### Why IoT security is crucial

### Why IoT security is crucial



Healthcare industry



Home



Government

#### Why IoT security is crucial



Healthcare industry





Home



Smart lock



#### Government



Security camera



CRIME

How your smart home devices can be turned against you



#### 'Internet of things' or 'vulnerability of everything'? Japan will hack its own citizens to find out



By <u>James Griffiths</u>, CNN Published 9:59 PM EST, Fri February 1, 2019



ev cit

#### Somebody's Watching: Hackers Breach Ring Home Security Cameras

Unnerved owners of the devices reported recent hacks in four states. The company reminded customers not to recycle passwords and user names.

## IoT attacks are happening now! IoT security is urgent!



ev cit

#### Somebody's Watching: Hackers Breach Ring Home Security Cameras

Unnerved owners of the devices reported recent hacks in four states. The company reminded customers not to recycle passwords and user names.

## IoT attacks are happening now! IoT security is urgent!



## IoT attacks are happening now! IoT security is urgent!



□ IoT security highly relies on data integrity to authenticate identity

- □ IoT security highly relies on data integrity to authenticate identity
- □ In engineering, cryptographic hash algorithm is adopted for data integrity

- □ IoT security highly relies on data integrity to authenticate identity
- □ In engineering, cryptographic hash algorithm is adopted for data integrity
- Practically infeasible to invert or reverse the hash computation

Hashing can provide Data Integrity and Identity Authentication



- Hashing can provide Data Integrity and Identity Authentication
  - They establish a mutual Secret Key with key encapsulation mechanism (KEM)



- Hashing can provide Data Integrity and Identity Authentication
  - They establish a mutual Secret Key with key encapsulation mechanism (KEM)



- Hashing can provide Data Integrity and Identity Authentication
  - They establish a mutual Secret Key with key encapsulation mechanism (KEM)
  - Alice combines Message + Secret Key to create Digest by Hashing



- Hashing can provide Data Integrity and Identity Authentication
  - They establish a mutual Secret Key with key encapsulation mechanism (KEM)
  - Alice combines Message + Secret Key to create Digest by Hashing



Hashing can provide Data Integrity and Identity Authentication

- They establish a mutual Secret Key with key encapsulation mechanism (KEM)
- Alice combines Message + Secret Key to create Digest by Hashing
- Bob verifies by calculating Hash of Message + Secret Key



Hashing can provide Data Integrity and Identity Authentication

- They establish a mutual Secret Key with key encapsulation mechanism (KEM)
- Alice combines Message + Secret Key to create Digest by Hashing
- Bob verifies by calculating Hash of Message + Secret Key
  - Message was not modified in transit ----- Integrity



- Hashing can provide Data Integrity and Identity Authentication
  - They establish a mutual Secret Key with key encapsulation mechanism (KEM)
  - Alice combines Message + Secret Key to create Digest by Hashing
  - Bob verifies by calculating Hash of Message + Secret Key
    - Message was not modified in transit ----- Integrity
    - Alice had the identical Secret Key ----- Authentication



### Cryptographic hash algorithm

- □ IoT security highly relies on data integrity to authenticate identity
- □ In engineering, cryptographic hash algorithm is adopted for data integrity
- Practically infeasible to invert or reverse the hash computation

### Cryptographic hash algorithm

- □ IoT security highly relies on data integrity to authenticate identity
- □ In engineering, cryptographic hash algorithm is adopted for data integrity
- Practically infeasible to invert or reverse the hash computation



Transport Layer Security in IoT (Amazon IoT Core)

### Cryptographic hash algorithm

- □ IoT security highly relies on data integrity to authenticate identity
- □ In engineering, cryptographic hash algorithm is adopted for data integrity
- Practically infeasible to invert or reverse the hash computation



Transport Layer Security in IoT (Amazon IoT Core)



Quantum-resistant TLS in IoT

□ Attackers can effortlessly obtain physical access to edge devices

- □ Attackers can effortlessly obtain physical access to edge devices
- □ Hardware resources are highly constrained in IoT devices

- □ Attackers can effortlessly obtain physical access to edge devices
- □ Hardware resources are highly constrained in IoT devices
- Performance matters especially in internet of vehicles

- □ Attackers can effortlessly obtain physical access to edge devices
- □ Hardware resources are highly constrained in IoT devices
- Performance matters especially in internet of vehicles
- □ Energy consumption matters since IoT is powered by battery

- Attackers can effortlessly obtain physical access to edge devices
- □ Hardware resources are highly constrained in IoT devices
- Performance matters especially in internet of vehicles
- Energy consumption matters since IoT is powered by battery

## Demand for low-latency, high-throughput and energy-efficient hashing in IoT devices

#### **Dedicated hardware engine on chip (ISSCC'16)**



- **Dedicated hardware engine on chip (ISSCC'16)** 
  - Low throughput



#### **Dedicated hardware engine on chip (ISSCC'16)**

- Low throughput
- High area overhead on chip


#### Dedicated hardware engine on chip (ISSCC'16)

- Low throughput
- High area overhead on chip

### □ General-purpose in-memory acceleration (JSSC'18)



#### **Dedicated hardware engine on chip (ISSCC'16)**

- Low throughput 0
- High area overhead on chip 0

#### **General-purpose in-memory acceleration (JSSC'18) High latency** 0







In-Memory 8KB

#### Dedicated hardware engine on chip (ISSCC'16)

- Low throughput
- High area overhead on chip

### □ General-purpose in-memory acceleration (JSSC'18)

- High latency
- Low throughput per unit area





#### Dedicated hardware engine on chip (ISSCC'16)

- Low throughput
- High area overhead on chip

#### □ General-purpose in-memory acceleration (JSSC'18)

- High latency
- Low throughput per unit area

#### **Dedicated in-memory acceleration (ISLPED'19)**







#### Dedicated hardware engine on chip (ISSCC'16)

- Low throughput
- High area overhead on chip

### □ General-purpose in-memory acceleration (JSSC'18)

- High latency
- Low throughput per unit area

#### **Dedicated in-memory acceleration (ISLPED'19)**

• High area overhead







#### Dedicated hardware engine on chip (ISSCC'16)

- Low throughput
- High area overhead on chip

### □ General-purpose in-memory acceleration (JSSC'18)

- High latency
- Low throughput per unit area

#### **Dedicated in-memory acceleration (ISLPED'19)**

- High area overhead
- Low generality







#### Dedicated hardware engine on chip (ISSCC'16)

- Low throughput
- High area overhead on chip

# Demand for low-latency, high-throughput, energy-efficient, low-overhead hashing in IoT

**Dedicated in-memory acceleration (ISLPED'19)** 

- High area overhead
- Low generality





#### On-chip Hashing

• Perform all the operations within the chip (trusted computing base)

#### On-chip Hashing -> high security level

• Perform all the operations within the chip (trusted computing base)

#### On-chip Hashing -> high security level

• Perform all the operations within the chip (trusted computing base)

#### Bitline Computing

• Repurpose SRAM subarrays into active large vector computation units

#### On-chip Hashing -> high security level

• Perform all the operations within the chip (trusted computing base)

#### Bitline Computing -> high throughput

• Repurpose SRAM subarrays into active large vector computation units

#### On-chip Hashing -> high security level

• Perform all the operations within the chip (trusted computing base)

#### Bitline Computing -> high throughput

Repurpose SRAM subarrays into active large vector computation units

#### Shift-optimized Data Alignment

o Implicitly perform inter-lane shift operations via the controller

#### On-chip Hashing -> high security level

• Perform all the operations within the chip (trusted computing base)

#### Bitline Computing -> high throughput

Repurpose SRAM subarrays into active large vector computation units

#### Shift-optimized Data Alignment -> low latency, energy

o Implicitly perform inter-lane shift operations via the controller

#### On-chip Hashing -> high security level

• Perform all the operations within the chip (trusted computing base)

#### **Bitline Computing -> high throughput**

Repurpose SRAM subarrays into active large vector computation units

#### Shift-optimized Data Alignment -> low latency, energy

o Implicitly perform inter-lane shift operations via the controller

#### In-Place Read/Write Strategy

• Carefully design read/write order and address to save memory capacity

#### On-chip Hashing -> high security level

• Perform all the operations within the chip (trusted computing base)

#### **Bitline Computing -> high throughput**

• Repurpose SRAM subarrays into active large vector computation units

#### Shift-optimized Data Alignment -> low latency, energy

o Implicitly perform inter-lane shift operations via the controller

#### In-Place Read/Write Strategy -> low overhead

• Carefully design read/write order and address to save memory capacity

#### On-chip Hashing -> high security level

• Perform all the operations within the chip (trusted computing base)

# Inhale can achieve **up to 14x** throughput-perarea, **172x** throughput-per-area-per-energy than state-of-the-art

#### In-Place Read/Write Strategy -> low overhead

• Carefully design read/write order and address to save memory capacity



#### **Bitline Computing [1]**

• Activate two wordlines simultaneously



- Activate two wordlines simultaneously
- Inherently perform logic operations



- Activate two wordlines simultaneously
- Inherently perform logic operations
  - NOR



- Activate two wordlines simultaneously
- Inherently perform logic operations
  - NOR
  - AND



- Activate two wordlines simultaneously
- Inherently perform logic operations
  - NOR
  - AND
- Additionally support other logic operations



- Activate two wordlines simultaneously
- Inherently perform logic operations
  - NOR
  - AND
- Additionally support other logic operations
  - XOR



- Activate two wordlines simultaneously
- Inherently perform logic operations
  - NOR
  - AND
- Additionally support other logic operations
  - XOR



- Activate two wordlines simultaneously
- Inherently perform logic operations
  - NOR
  - AND
- Additionally support other logic operations
  - XOR
- Provide high parallelism



□ 76% operations of SHA-3 are shifting in a vanilla PIM architecture

- □ 76% operations of SHA-3 are shifting in a vanilla PIM architecture
- □ 90% shifting operations are inter-lane

- □ 76% operations of SHA-3 are shifting in a vanilla PIM architecture
- □ 90% shifting operations are inter-lane



x=0 x=1 x=2 x=3 x=4

- □ 76% operations of SHA-3 are shifting in a vanilla PIM architecture
- □ 90% shifting operations are inter-lane





x=0 x=1 x=2 x=3 x=4

- □ 76% operations of SHA-3 are shifting in a vanilla PIM architecture
- □ 90% shifting operations are inter-lane



### **Prior works**

### **Prior works**

#### **Existing Data Alignments**

12

### **Prior works**

#### **Existing Data Alignments**

- JSSC'18:

| Intermediate |        |        |        |        |
|--------------|--------|--------|--------|--------|
| Lanell       | Lane V | Lane W | Lane X | Lane V |
| Lane P       | Lane Q | Lane R | Lane S | Lane T |
| Lane K       | Lane L | Lane M | Lane N | Lane O |
| Lane F       | Lane G | Lane H | Lane I | Lane J |
| Lane A       | Lane B | Lane C | Lane D | Lane E |

SRAM subarray (JSSC'18)
## **Existing Data Alignments**

- JSSC'18:
  - highly utilize the parallelism

| Lane A       | Lane B | Lane C | Lane D | Lane E |  |
|--------------|--------|--------|--------|--------|--|
| Lane F       | Lane G | Lane H | Lane I | Lane J |  |
| Lane K       | Lane L | Lane M | Lane N | Lane O |  |
| Lane P       | Lane Q | Lane R | Lane S | Lane T |  |
| Lane U       | Lane V | Lane W | Lane X | Lane Y |  |
| Intermediate |        |        |        |        |  |

SRAM subarray (JSSC'18)

## **Existing Data Alignments**

- JSSC'18:
  - highly utilize the parallelism
  - hard for inter-lane and intralane shift

| Lane A       | Lane B | Lane C | Lane D | Lane E |  |
|--------------|--------|--------|--------|--------|--|
| Lane F       | Lane G | Lane H | Lane I | Lane J |  |
| Lane K       | Lane L | Lane M | Lane N | Lane O |  |
| Lane P       | Lane Q | Lane R | Lane S | Lane T |  |
| Lane U       | Lane V | Lane W | Lane X | Lane Y |  |
| Intermediate |        |        |        |        |  |

SRAM subarray (JSSC'18)

## **Existing Data Alignments**

- JSSC'18:
  - highly utilize the parallelism
  - hard for inter-lane and intralane shift



SRAM subarray (JSSC'18)

## **Existing Data Alignments**

- JSSC'18:
  - highly utilize the parallelism
  - hard for inter-lane and intralane shift

| Two-lane | Lane A                  | <del>Lane B</del> | Lane C | Lane D | Lane E |  |
|----------|-------------------------|-------------------|--------|--------|--------|--|
| XOR      | Lane F                  | Lane G            | Lane H | Lane I | Lane J |  |
|          | Lane K                  | Lane L            | Lane M | Lane N | Lane O |  |
|          | Lane P                  | Lane Q            | Lane R | Lane S | Lane T |  |
|          | Lane U                  | Lane V            | Lane W | Lane X | Lane Y |  |
|          | Intermediate            |                   |        |        |        |  |
|          | SRAM subarray (JSSC'18) |                   |        |        |        |  |

## **Existing Data Alignments**

- JSSC'18:
  - highly utilize the parallelism
  - hard for inter-lane and intralane shift



First inter-lane shift, then one XOR

## **Existing Data Alignments**

- JSSC'18:
  - highly utilize the parallelism
  - hard for inter-lane and intralane shift



First inter-lane shift, then one XOR

## **Existing Data Alignments**

- JSSC'18:
  - highly utilize the parallelism
  - hard for inter-lane and intralane shift



## **Existing Data Alignments**

• JSSC'18:

- ISCA'18:
- highly utilize the parallelism
- hard for inter-lane and intralane shift





SRAM subarray (ISCA'18)

## **Existing Data Alignments**

• JSSC'18:

• ISCA'18:

shift implicitly

- highly utilize the parallelism
- hard for inter-lane and intralane shift



A Lane 9 Lane C Lane 0 Lane Ш Lane Ц ane . . . .

SRAM subarray (ISCA'18)

## **Existing Data Alignments**

• JSSC'18:

• ISCA'18:

shift implicitly

- highly utilize the parallelism
- hard for inter-lane and intralane shift





#### **Existing Data Alignments**

- **JSSC'18**: 0
  - highly utilize the parallelism
  - hard for inter-lane and intra-lane shift



- shift implicitly
- high latency (>10x JSSC'18)





P

ani

9

Lane

C

Lane

0

ane

Ш

ane

First inter-lane shift, then one XOR

**Two-lane XOR** 

#### **Existing Data Alignments**

- **JSSC'18**: 0
  - highly utilize the parallelism

ISCA'18:

0

hard for inter-lane and intra-lane shift





## **Existing Data Alignments**

- JSSC'18:
  - highly utilize the parallelism
  - hard for inter-lane and intralane shift

First inter-lane shift, then one XOR





Shift-optimized Data Alignment

| 320 bits     |        |        |        |        |  |  |
|--------------|--------|--------|--------|--------|--|--|
| ane A        | Lane B | Lane C | Lane D | Lane E |  |  |
| ane F        | Lane G | Lane H | Lane I | Lane J |  |  |
| ane K        | Lane L | Lane M | Lane N | Lane O |  |  |
| ane P        | Lane Q | Lane R | Lane S | Lane T |  |  |
| ane U        | Lane V | Lane W | Lane X | Lane Y |  |  |
| Intermediate |        |        |        |        |  |  |

JSSC'18

1 bit A Lane Ω Lane C ane 0 ane ш Lane Ц ane ....

**ISCA'18** 

- Shift-optimized Data Alignment
  - Place lane per row

Jane BJane BJane BJane BJane BJane BJane DJane DJane DJane DJane DJane DJane DLane PLane QLane RLane SLane TLane ULane VLane WLane XLane YIntermediate

JSSC'18

1 bit P ane Ω ane C ane 0 Lane Ш Lane Ц ane ....

**ISCA'18** 

13

- Shift-optimized Data Alignment
  - Place lane per row



| 320 bits     |        |        |        |        |  |  |
|--------------|--------|--------|--------|--------|--|--|
| Lane A       | Lane B | Lane C | Lane D | Lane E |  |  |
| Lane F       | Lane G | Lane H | Lane I | Lane J |  |  |
| Lane K       | Lane L | Lane M | Lane N | Lane O |  |  |
| Lane P       | Lane Q | Lane R | Lane S | Lane T |  |  |
| Lane U       | Lane V | Lane W | Lane X | Lane Y |  |  |
| Intermediate |        |        |        |        |  |  |

1 bit A Lane Ω Lane C ane 0 ane \_ Ш ane Ц ane ....

ISCA'18

#### Proposed Inhale

## Shift-optimized Data Alignment

- Place lane per row
- Inter-lane shifts are costless with the controller



| 320 bits     |        |        |        |        |  |  |
|--------------|--------|--------|--------|--------|--|--|
| Lane A       | Lane B | Lane C | Lane D | Lane E |  |  |
| Lane F       | Lane G | Lane H | Lane I | Lane J |  |  |
| Lane K       | Lane L | Lane M | Lane N | Lane O |  |  |
| Lane P       | Lane Q | Lane R | Lane S | Lane T |  |  |
| Lane U       | Lane V | Lane W | Lane X | Lane Y |  |  |
| Intermediate |        |        |        |        |  |  |

**ISCA'18** 

13

## Shift-optimized Data Alignment

- Place lane per row
- Inter-lane shifts are costless with the controller



Proposed Inhale

First inter-lane shift, then one XOR



JSSC'18

# Shift-optimized Data Alignment Place lane per row Inter-lane shifts are costless with the controller



First inter-lane shift, then one XOR





Proposed Inhale











□ In-place read/write strategy

### In-place read/write strategy

 Read/write order and address are carefully designed to save memory capacity and maintain generality of our solution in varied IoT devices


























One round of SHA-3 CT<sub>4</sub>=XOR(E<sub>0</sub>,J<sub>0</sub>,O<sub>0</sub>,T<sub>0</sub>,Y<sub>0</sub>) CT<sub>4</sub>\*=rot(CT<sub>4</sub>,1) CT<sub>1</sub>=XOR(B<sub>0</sub>,G<sub>0</sub>,L<sub>0</sub>,Q<sub>0</sub>,V<sub>0</sub>) FT<sub>0</sub>=XOR(CT<sub>1</sub>,CT<sub>4</sub>\*)



One round of SHA-3 CT<sub>4</sub>=XOR(E<sub>0</sub>,J<sub>0</sub>,O<sub>0</sub>,T<sub>0</sub>,Y<sub>0</sub>) CT<sub>4</sub>\*=rot(CT<sub>4</sub>,1) CT<sub>1</sub>=XOR(B<sub>0</sub>,G<sub>0</sub>,L<sub>0</sub>,Q<sub>0</sub>,V<sub>0</sub>) FT<sub>0</sub>=XOR(CT<sub>1</sub>,CT<sub>4</sub>\*)

**CT**<sub>1</sub>\*=rot(**CT**<sub>1</sub>,1)



One round of SHA-3 CT<sub>4</sub>=XOR(E<sub>0</sub>,J<sub>0</sub>,O<sub>0</sub>,T<sub>0</sub>,Y<sub>0</sub>) CT<sub>4</sub>\*=rot(CT<sub>4</sub>,1) CT<sub>1</sub>=XOR(B<sub>0</sub>,G<sub>0</sub>,L<sub>0</sub>,Q<sub>0</sub>,V<sub>0</sub>) FT<sub>0</sub>=XOR(CT<sub>1</sub>,CT<sub>4</sub>\*)

**CT**<sub>1</sub>\*=rot(**CT**<sub>1</sub>,1)



One round of SHA-3

 $CT_{4}=XOR(E_{0},J_{0},O_{0},T_{0},Y_{0})$   $CT_{4}^{*}=rot(CT_{4},1)$   $CT_{1}=XOR(B_{0},G_{0},L_{0},Q_{0},V_{0})$  $FT_{0}=XOR(CT_{1},CT_{4}^{*})$ 

 $CT_{1}^{*} = rot(CT_{1}, 1)$  $CT_{3} = XOR(D_{0}, I_{0}, N_{0}, S_{0}, X_{0})$ 



One round of SHA-3

 $CT_{4}=XOR(E_{0},J_{0},O_{0},T_{0},Y_{0})$   $CT_{4}^{*}=rot(CT_{4},1)$   $CT_{1}=XOR(B_{0},G_{0},L_{0},Q_{0},V_{0})$   $FT_{0}=XOR(CT_{1},CT_{4}^{*})$ 

 $CT_1^* = rot(CT_1, 1)$   $CT_3 = XOR(D_0, I_0, N_0, S_0, X_0)$  $FT_2 = XOR(CT_3, CT_1^*)$ 



One round of SHA-3 CT<sub>4</sub>=XOR(E<sub>0</sub>,J<sub>0</sub>,O<sub>0</sub>,T<sub>0</sub>,Y<sub>0</sub>)

 $CT_{4}^{*}=rot(CT_{4},1)$   $CT_{1}=XOR(B_{0},G_{0},L_{0},Q_{0},V_{0})$   $FT_{0}=XOR(CT_{1},CT_{4}^{*})$ 

 $CT_1^* = rot(CT_1, 1)$   $CT_3 = XOR(D_0, I_0, N_0, S_0, X_0)$  $FT_2 = XOR(CT_3, CT_1^*)$ 



One round of SHA-3

CT<sub>4</sub>=XOR(E<sub>0</sub>,J<sub>0</sub>,O<sub>0</sub>,T<sub>0</sub>,Y<sub>0</sub>) CT<sub>4</sub>\*=rot(CT<sub>4</sub>,1) CT<sub>1</sub>=XOR(B<sub>0</sub>,G<sub>0</sub>,L<sub>0</sub>,Q<sub>0</sub>,V<sub>0</sub>) FT<sub>0</sub>=XOR(CT<sub>1</sub>,CT<sub>4</sub>\*)

 $CT_1^* = rot(CT_1, 1)$   $CT_3 = XOR(D_0, I_0, N_0, S_0, X_0)$  $FT_2 = XOR(CT_3, CT_1^*)$ 

**CT**<sub>3</sub>\*=rot(**CT**<sub>3</sub>,1)



One round of SHA-3

CT<sub>4</sub>=XOR(E<sub>0</sub>,J<sub>0</sub>,O<sub>0</sub>,T<sub>0</sub>,Y<sub>0</sub>) CT<sub>4</sub>\*=rot(CT<sub>4</sub>,1) CT<sub>1</sub>=XOR(B<sub>0</sub>,G<sub>0</sub>,L<sub>0</sub>,Q<sub>0</sub>,V<sub>0</sub>) FT<sub>0</sub>=XOR(CT<sub>1</sub>,CT<sub>4</sub>\*)

 $CT_1^* = rot(CT_1, 1)$   $CT_3 = XOR(D_0, I_0, N_0, S_0, X_0)$  $FT_2 = XOR(CT_3, CT_1^*)$ 

**CT**<sub>3</sub>\*=rot(**CT**<sub>3</sub>,1)



CT<sub>4</sub>=XOR(E<sub>0</sub>,J<sub>0</sub>,O<sub>0</sub>,T<sub>0</sub>,Y<sub>0</sub>) CT<sub>4</sub>\*=rot(CT<sub>4</sub>,1) CT<sub>1</sub>=XOR(B<sub>0</sub>,G<sub>0</sub>,L<sub>0</sub>,Q<sub>0</sub>,V<sub>0</sub>) FT<sub>0</sub>=XOR(CT<sub>1</sub>,CT<sub>4</sub>\*)

One round of SHA-3

 $CT_1^* = rot(CT_1, 1)$   $CT_3 = XOR(D_0, I_0, N_0, S_0, X_0)$  $FT_2 = XOR(CT_3, CT_1^*)$ 

CT<sub>3</sub>\*=rot(CT<sub>3</sub>,1) CT<sub>0</sub>=XOR(A<sub>0</sub>,F<sub>0</sub>,K<sub>0</sub>,P<sub>0</sub>,U<sub>0</sub>)



 $CT_{4}=XOR(E_{0},J_{0},O_{0},T_{0},Y_{0})$   $CT_{4}^{*}=rot(CT_{4},1)$   $CT_{1}=XOR(B_{0},G_{0},L_{0},Q_{0},V_{0})$   $FT_{0}=XOR(CT_{1},CT_{4}^{*})$ 

One round of SHA-3

 $CT_1^* = rot(CT_1, 1)$   $CT_3 = XOR(D_0, I_0, N_0, S_0, X_0)$  $FT_2 = XOR(CT_3, CT_1^*)$ 

CT<sub>3</sub>\*=rot(CT<sub>3</sub>,1) CT<sub>0</sub>=XOR(A<sub>0</sub>,F<sub>0</sub>,K<sub>0</sub>,P<sub>0</sub>,U<sub>0</sub>) FT<sub>4</sub>=XOR(CT<sub>0</sub>,CT<sub>3</sub>\*)



CT<sub>4</sub>=XOR(E<sub>0</sub>,J<sub>0</sub>,O<sub>0</sub>,T<sub>0</sub>,Y<sub>0</sub>) CT<sub>4</sub>\*=rot(CT<sub>4</sub>,1) CT<sub>1</sub>=XOR(B<sub>0</sub>,G<sub>0</sub>,L<sub>0</sub>,Q<sub>0</sub>,V<sub>0</sub>) FT<sub>0</sub>=XOR(CT<sub>1</sub>,CT<sub>4</sub>\*)

One round of SHA-3

 $\begin{array}{c} CT_{1}^{*} = rot(CT_{1}, 1) \\ CT_{3} = XOR(D_{0}, I_{0}, N_{0}, S_{0}, X_{0}) \\ FT_{2} = XOR(CT_{3}, CT_{1}^{*}) \end{array}$ 

CT<sub>3</sub>\*=rot(CT<sub>3</sub>,1) CT<sub>0</sub>=XOR(A<sub>0</sub>,F<sub>0</sub>,K<sub>0</sub>,P<sub>0</sub>,U<sub>0</sub>) FT<sub>4</sub>=XOR(CT<sub>0</sub>,CT<sub>3</sub>\*)



 $\begin{array}{c} \mathsf{CT}_{4} = \mathsf{XOR}(\mathsf{E}_{0}, \mathsf{J}_{0}, \mathsf{O}_{0}, \mathsf{T}_{0}, \mathsf{Y}_{0}) \\ \mathsf{CT}_{4}^{*} = \mathsf{rot}(\mathsf{CT}_{4}, \mathsf{1}) \\ \mathsf{CT}_{1} = \mathsf{XOR}(\mathsf{B}_{0}, \mathsf{G}_{0}, \mathsf{L}_{0}, \mathsf{Q}_{0}, \mathsf{V}_{0}) \\ \mathsf{FT}_{0} = \mathsf{XOR}(\mathsf{CT}_{1}, \mathsf{CT}_{4}^{*}) \end{array}$ 

One round of SHA-3

 $CT_{1}^{*}=rot(CT_{1},1)$   $CT_{3}=XOR(D_{0},I_{0},N_{0},S_{0},X_{0})$   $FT_{2}=XOR(CT_{3},CT_{1}^{*})$ 

CT<sub>3</sub>\*=rot(CT<sub>3</sub>,1) CT<sub>0</sub>=XOR(A<sub>0</sub>,F<sub>0</sub>,K<sub>0</sub>,P<sub>0</sub>,U<sub>0</sub>) FT<sub>4</sub>=XOR(CT<sub>0</sub>,CT<sub>3</sub>\*)

CT<sub>0</sub>\*=rot(CT<sub>0</sub>,1)



CT<sub>4</sub>=XOR(E<sub>0</sub>,J<sub>0</sub>,O<sub>0</sub>,T<sub>0</sub>,Y<sub>0</sub>) CT<sub>4</sub>\*=rot(CT<sub>4</sub>,1) CT<sub>1</sub>=XOR(B<sub>0</sub>,G<sub>0</sub>,L<sub>0</sub>,Q<sub>0</sub>,V<sub>0</sub>) FT<sub>0</sub>=XOR(CT<sub>1</sub>,CT<sub>4</sub>\*)

One round of SHA-3

 $CT_{1}^{*}=rot(CT_{1},1)$   $CT_{3}=XOR(D_{0},I_{0},N_{0},S_{0},X_{0})$   $FT_{2}=XOR(CT_{3},CT_{1}^{*})$ 

CT<sub>3</sub>\*=rot(CT<sub>3</sub>,1) CT<sub>0</sub>=XOR(A<sub>0</sub>,F<sub>0</sub>,K<sub>0</sub>,P<sub>0</sub>,U<sub>0</sub>) FT<sub>4</sub>=XOR(CT<sub>0</sub>,CT<sub>3</sub>\*)

**CT**<sub>0</sub>\*=rot(**CT**<sub>0</sub>,1)



One round of SHA-3 CT<sub>4</sub>=XOR(E<sub>0</sub>,J<sub>0</sub>,O<sub>0</sub>,T<sub>0</sub>,Y<sub>0</sub>) CT<sub>4</sub>\*=rot(CT<sub>4</sub>,1) CT<sub>1</sub>=XOR(B<sub>0</sub>,G<sub>0</sub>,L<sub>0</sub>,Q<sub>0</sub>,V<sub>0</sub>) FT<sub>0</sub>=XOR(CT<sub>1</sub>,CT<sub>4</sub>\*)

 $CT_{1}^{*}=rot(CT_{1},1)$   $CT_{3}=XOR(D_{0},I_{0},N_{0},S_{0},X_{0})$   $FT_{2}=XOR(CT_{3},CT_{1}^{*})$ 

CT<sub>3</sub>\*=rot(CT<sub>3</sub>,1) CT<sub>0</sub>=XOR(A<sub>0</sub>,F<sub>0</sub>,K<sub>0</sub>,P<sub>0</sub>,U<sub>0</sub>) FT<sub>4</sub>=XOR(CT<sub>0</sub>,CT<sub>3</sub>\*)

CT<sub>0</sub>\*=rot(CT<sub>0</sub>,1) CT<sub>2</sub>=XOR(C<sub>0</sub>,H<sub>0</sub>,M<sub>0</sub>,R<sub>0</sub>,W<sub>0</sub>)



One round of SHA-3 CT<sub>4</sub>=XOR(E<sub>0</sub>,J<sub>0</sub>,O<sub>0</sub>,T<sub>0</sub>,Y<sub>0</sub>) CT<sub>4</sub>\*=rot(CT<sub>4</sub>,1) CT<sub>1</sub>=XOR(B<sub>0</sub>,G<sub>0</sub>,L<sub>0</sub>,Q<sub>0</sub>,V<sub>0</sub>) FT<sub>0</sub>=XOR(CT<sub>1</sub>,CT<sub>4</sub>\*)

 $CT_{1}^{*}=rot(CT_{1},1)$   $CT_{3}=XOR(D_{0},I_{0},N_{0},S_{0},X_{0})$   $FT_{2}=XOR(CT_{3},CT_{1}^{*})$ 

CT<sub>3</sub>\*=rot(CT<sub>3</sub>,1) CT<sub>0</sub>=XOR(A<sub>0</sub>,F<sub>0</sub>,K<sub>0</sub>,P<sub>0</sub>,U<sub>0</sub>) FT<sub>4</sub>=XOR(CT<sub>0</sub>,CT<sub>3</sub>\*)

 $CT_{0}^{*} = rot(CT_{0}, 1)$   $CT_{2} = XOR(C_{0}, H_{0}, M_{0}, R_{0}, W_{0})$   $FT_{1} = XOR(CT_{2}, CT_{0}^{*})$ 



One round of SHA-3 CT<sub>4</sub>=XOR(E<sub>0</sub>,J<sub>0</sub>,O<sub>0</sub>,T<sub>0</sub>,Y<sub>0</sub>) CT<sub>4</sub>\*=rot(CT<sub>4</sub>,1) CT<sub>1</sub>=XOR(B<sub>0</sub>,G<sub>0</sub>,L<sub>0</sub>,Q<sub>0</sub>,V<sub>0</sub>) FT<sub>0</sub>=XOR(CT<sub>1</sub>,CT<sub>4</sub>\*)

 $CT_1^* = rot(CT_1, 1)$   $CT_3 = XOR(D_0, I_0, N_0, S_0, X_0)$  $FT_2 = XOR(CT_3, CT_1^*)$ 

CT<sub>3</sub>\*=rot(CT<sub>3</sub>,1) CT<sub>0</sub>=XOR(A<sub>0</sub>,F<sub>0</sub>,K<sub>0</sub>,P<sub>0</sub>,U<sub>0</sub>) FT<sub>4</sub>=XOR(CT<sub>0</sub>,CT<sub>3</sub>\*)

 $\begin{array}{c} CT_{0}^{*} = rot(CT_{0}, 1) \\ CT_{2} = XOR(C_{0}, H_{0}, M_{0}, R_{0}, W_{0}) \\ FT_{1} = XOR(CT_{2}, CT_{0}^{*}) \end{array}$ 



One round of SHA-3 CT<sub>4</sub>=XOR(E<sub>0</sub>,J<sub>0</sub>,O<sub>0</sub>,T<sub>0</sub>,Y<sub>0</sub>) CT<sub>4</sub>\*=rot(CT<sub>4</sub>,1) CT<sub>1</sub>=XOR(B<sub>0</sub>,G<sub>0</sub>,L<sub>0</sub>,Q<sub>0</sub>,V<sub>0</sub>) FT<sub>0</sub>=XOR(CT<sub>1</sub>,CT<sub>4</sub>\*)

 $CT_1^* = rot(CT_1, 1)$   $CT_3 = XOR(D_0, I_0, N_0, S_0, X_0)$  $FT_2 = XOR(CT_3, CT_1^*)$ 

CT<sub>3</sub>\*=rot(CT<sub>3</sub>,1) CT<sub>0</sub>=XOR(A<sub>0</sub>,F<sub>0</sub>,K<sub>0</sub>,P<sub>0</sub>,U<sub>0</sub>) FT<sub>4</sub>=XOR(CT<sub>0</sub>,CT<sub>3</sub>\*)

 $\begin{array}{c} CT_{0}^{*} = rot(CT_{0}, 1) \\ CT_{2} = XOR(C_{0}, H_{0}, M_{0}, R_{0}, W_{0}) \\ FT_{1} = XOR(CT_{2}, CT_{0}^{*}) \end{array}$ 

CT<sub>2</sub>\*=rot(CT<sub>2</sub>,1)



One round of SHA-3 CT<sub>4</sub>=XOR(E<sub>0</sub>,J<sub>0</sub>,O<sub>0</sub>,T<sub>0</sub>,Y<sub>0</sub>) CT<sub>4</sub>\*=rot(CT<sub>4</sub>,1) CT<sub>1</sub>=XOR(B<sub>0</sub>,G<sub>0</sub>,L<sub>0</sub>,Q<sub>0</sub>,V<sub>0</sub>) FT<sub>0</sub>=XOR(CT<sub>1</sub>,CT<sub>4</sub>\*)

 $CT_1^* = rot(CT_1, 1)$   $CT_3 = XOR(D_0, I_0, N_0, S_0, X_0)$  $FT_2 = XOR(CT_3, CT_1^*)$ 

CT<sub>3</sub>\*=rot(CT<sub>3</sub>,1) CT<sub>0</sub>=XOR(A<sub>0</sub>,F<sub>0</sub>,K<sub>0</sub>,P<sub>0</sub>,U<sub>0</sub>) FT<sub>4</sub>=XOR(CT<sub>0</sub>,CT<sub>3</sub>\*)

 $\begin{array}{c} CT_{0}^{*} = rot(CT_{0}, 1) \\ CT_{2} = XOR(C_{0}, H_{0}, M_{0}, R_{0}, W_{0}) \\ FT_{1} = XOR(CT_{2}, CT_{0}^{*}) \end{array}$ 

CT<sub>2</sub>\*=rot(CT<sub>2</sub>,1)









### High-performance, energy-efficient and low-overhead hashing engine



16







# **Evaluation Methodology**

- □ Read and write latency:
  - PyMTL3 and OpenRAM for generating SRAM arrays
  - Synopsys Design Compiler for extracting latencies
  - Latencies of ReRAM array from DESTINY simulator
- □ Area and energy numbers simulated by DESTINY simulator
  - Kilo Gate Equivalent (KGE) is used to decouple the area overhead from the technology node
- □ For apples-to-apples comparison between different designs
  - Inhale and SHINE in 28nm ReRAM and SRAM are all evaluated

Jiang, Shunning, et al. "PyMTL3: A Python framework for open-source hardware modeling, generation, simulation, and verification." MICRO'20. Guthaus, Matthew R., et al. "OpenRAM: An open-source memory compiler." ICCAD'16. Poremba, Matt, et al. "Destiny: A tool for modeling emerging 3d nvm and edram caches." DATE'15. Nagarajan, Karthikeyan, et al. "SHINE: A novel SHA-3 implementation using ReRAM-based in-memory computing." ISLPED'19















<sup>18</sup> 








<sup>18</sup> 



Inhale over Recryptor (JSSC'18)
Inhale over SHINE (ISLPED'18)



Inhale over Recryptor (JSSC'18)



Inhale over SHINE (ISLPED'18)





Inhale over Recryptor (JSSC'18)
Inhale over SHINE (ISLPED'18)







19







### □ With power constraint

SHINE hits power earlier than *Inhale* 



### □ With power constraint

SHINE hits power earlier than *Inhale* 



### □ With power constraint

SHINE hits power earlier than *Inhale* 



#### □ Without power constraint



### □ With power constraint

SHINE hits power earlier than *Inhale* 



#### □ Without power constraint



### □ With power constraint

SHINE hits power earlier than *Inhale* 



#### □ Without power constraint



#### With power constraint

SHINE hits power earlier than Inhale



#### Without power constraint



Inhale provides high performance, energy efficiency, low overhead all by proposing an in-SRAM hashing engine

Inhale provides high performance, energy efficiency, low overhead all by proposing an in-SRAM hashing engine

Shift-optimized data alignment and in-place read/write strategy are proposed to efficiently map the algorithm to the *Inhale* architecture

Inhale provides high performance, energy efficiency, low overhead all by proposing an in-SRAM hashing engine

Shift-optimized data alignment and in-place read/write strategy are proposed to efficiently map the algorithm to the *Inhale* architecture

Inhale can achieve up to 14x throughput-per-area, 172x throughput-perarea-per-energy than state-of-the-art

Inhale provides high performance, energy efficiency, low overhead all by proposing an in-SRAM hashing engine

Shift-optimized data alignment and in-place read/write strategy are proposed to efficiently map the algorithm to the *Inhale* architecture

Inhale can achieve up to 14x throughput-per-area, 172x throughput-perarea-per-energy than state-of-the-art

Future work is providing an end-to-end solution for IoT security, and supporting other cryptographic operations

