# CS249r: Frameworks

Sep. 25, 2023

### Goals for today

- 1. Logistics
- 2. Lecture
- 3. Paper discussions
- 4. Guest speaker

## **Course Logistics**

### **Class Schedule**

|                             | Start    | ~End    |
|-----------------------------|----------|---------|
| Lecture                     | 12:45 PM | 1:40 PM |
| Break                       | 1:40 PM  | 1:45 PM |
| Paper Discussion (combined) | 1:45 PM  | 2:25 PM |
| Break                       | 2:25 PM  | 2:30 PM |
| Guest Lecture               | 2:30 PM  | 3:30 PM |

### My Office Hours

- Every Monday
  - After class @ 3:30pm
  - Meet in class / office
  - $\circ$  My focus
    - Technical material
    - Lecture content
    - Paper readings / discussions etc.

### **Updated Office Hours**



#### Matthew Stewart

Postdoctoral Researcher, Harvard University

Office Hours: 5-6 PM Thursday (SEC 1.412)

#### <u>lkechukwu Uchendu</u>

Computer Science, PhD Student, Harvard University

Office Hours: 9-10 AM Wednesday (TBD)



#### <u>Jason Jabbour</u>

Computer Science, PhD Student, Harvard University

Office Hours: 11-12 PM Friday (TBD)



#### Jessica Quaye

Computer Science, PhD Student, Harvard University

Office Hours: 1-2 PM Tuesday (SEC 5.403)

### Paper Discussion Sign-up Sheet

- 1. Please take a few minutes to sign up for next week's paper discussion:
- 2. Submit to Canvas:
  - a. Paper Reading Group 1 for your first time
  - b. Paper Reading Group 2 for your second time
- 3. Extra Credit for signing up for 3rd spot
  - a. 2.5% credit!



### Assignment #1

- Any issues with the hardware?
  - $\circ$  To be expected when working with embedded systems  $\boldsymbol{\curlyvee}$
- New Due Dates:
  - **Part 1:** Oct 2nd
  - Part 2: Oct 10th (due to Indigenous Peoples Day)
- Questions?



### Book

#### • Updates

- Week of topic
  - i. Meet with Matthew & VJ
  - ii. Create a google doc
  - iii. Share with staff
  - iv. Iterate on rough outline  $\rightarrow$
  - v. Start drafting collaboratively
  - vi. Push to GitHub
  - vii. Peer review

#### Q FRONT MATTER

Preface Dedication Acknowledgements Copyright About This Book WELCOME 1 Introduction 2 Embedded Systems 3 Deep Learning Primer 4 Embedded ML DEEP DIVE 5 ML Workflow 6 Data Engineering 7 Pre-processing 8 ML Frameworks 9 Model Training 10 Efficient Al 11 Optimizations 12 Deployment 13 On-Device Learning 14 Hardware Acceleration 15 MLOps 16 Privacy and Security 17 AI Sustainability 18 Responsible AI 19 Generative Al References Appendices A Tools B Resources C Communities D Case Studies

#### Embedded AI: Principles, Algorithms, and Applications

#### Preface

In "Embedded AI: Principles, Algorithms, and Applications", we will embark on a critical exploration of the rapidly evolving field of artificial intelligence in the context of embedded systems, originally nurtured from the foundational course, timyML from CS249r.

The goal of this book is to bring about a collaborative endeavor with insights and contributions from students, practitioners and the wider community, blossoming into a comprehensive guide that delves into the principles governing embedded AI and its myriad applications.

"If you want to go fast, go alone, if you want to go far, go together." – African Proverb

As a living document, this open-source textbook aims to bridge gaps and foster innovation by being globally accessible and continually updated, addressing the pressing need for a centralized resource in this dynamic field. With a rich tapestry of knowledge woven from various expert perspectives, readers can anticipate a guided journey that unveils the intricate dance between cutting-edge algorithms and the principles that ground them, paving the way for the next wave of technological transformation.

#### The Philosophy Behind the Book

We live in a world where technology perpetually reshapes itself, fostering an ecosystem of open collaboration and knowledge sharing stands as the cornerstone of innovation. This philosophy fuels the creation of "Embedded AI: Principles, Algorithms, and Applications." This is a venture that transcends conventional textbook paradigms to foster a living repository of knowledge. Anchoring its content on principles, algorithms, and applications, the book aims to cultivate a deep-rooted understanding that empowers individuals to navigate the fluid landscape of embedded AI with agility and foresight. By embracing an open approach, we not only democratize learning but also pave avenues for fresh perspectives and iterative enhancements, thus fostering a community where knowledge is not confined but is nurtured to grow, adapt, and illuminate the path of progress in embedded AI technologies globally.

How to Contribute Contributors O Edit this page Report an issue View source

Table of contents

The Philosophy Behind the

Conventions Used in This

How to Contact Us

Preface

Book

### Project Sign-Ups

- Join the #projects channel and post your interests
- Once you find a group, sign up on this spreadsheet:
  - (same sheet used for scribing sign-up)













C array models



C array models

### Framework Differences















| Model                                    | <b>†</b><br>TensorFlow              | TensorFlow Lite        | TensorFlow Lite Micro                      |
|------------------------------------------|-------------------------------------|------------------------|--------------------------------------------|
| Training                                 | Yes                                 | No                     | No                                         |
| Inference                                | Yes<br>(but inefficient<br>on edge) | Yes<br>(and efficient) | Yes<br>(and even<br><b>more</b> efficient) |
| How Many Ops                             | ~1400                               | ~130                   | ~50                                        |
| Native Quantization<br>Tooling + Support | No                                  | Yes                    | Yes                                        |

| Software                      | TensorFlow | TensorFlow Lite | TensorFlow Lite Micro |
|-------------------------------|------------|-----------------|-----------------------|
| Needs an OS                   | Yes        | Yes             | No                    |
| Memory Mapping<br>of Models   | No         | Yes             | Yes                   |
| Delegation to<br>accelerators | Yes        | Yes             | No                    |

| Hardware                   | TensorFlow      | TensorFlow Lite   | TensorFlow Lite Micro       |
|----------------------------|-----------------|-------------------|-----------------------------|
| Base Binary Size           | 3MB+            | 100KB             | ~10 KB                      |
| Base Memory<br>Footprint   | ~5MB            | 300KB             | 20KB                        |
| Optimized<br>Architectures | X86, TPUs, GPUs | Arm Cortex A, x86 | Arm Cortex M,<br>DSPs, MCUs |

# Keyword Spotting Example

### Keyword Spotting Components



### Device Microphone





Microphone

### Audio Provider



### Feature **Extractor**





### Data Preprocessing: Spectrograms

Ó





### TFLite Micro Interpreter



### Model



### Model **Operators**





TinyConv Keyword Spotting Model

### Command Recognizer





### Command Responder



#### Device Response







### Keyword Spotting Components



## **TensorFlow Lite Micro**

Embedded Machine Learning on TinyML Systems











| System                                                            | Chrom-ART Accelerator™<br>JPEG Codec Acceleration | 2-Mbyte dual-bank<br>Flash memory                           |
|-------------------------------------------------------------------|---------------------------------------------------|-------------------------------------------------------------|
| SMPS, LDO, USB and                                                |                                                   | RAM 1056KB incl.<br>64KB ITCM                               |
| backup regulators<br>POR/PDR/PVD/BOR                              |                                                   | FMC/SRAM/NOR/NAND/<br>SDRAM                                 |
| Multi-power domains                                               |                                                   | Dual Quad-SPI                                               |
| Xtal oscillators<br>32 kHz + 4 ~48 MHz<br>Internal RC oscillators | Cache I/D 16+16 Kbytes                            | 1024-byte + 4-Kbyte<br>backup SRAM                          |
| 32 kHz + 4, 48 & 64 MHz                                           |                                                   | Connectivity                                                |
| 3x PLL                                                            |                                                   | TFT LCD controller                                          |
| Clock control                                                     |                                                   | MPI-DSI                                                     |
| RTC/AWU                                                           | Arm®                                              | HDMI-CEC                                                    |
| 1x SysTick timer                                                  | Cortex <sup>®</sup> -M7                           | 6x SPI, 3x I <sup>2</sup> S, 4x I <sup>2</sup> C            |
| 2x watchdogs                                                      | 480 MHz                                           | Camera interface                                            |
| (independent and window)                                          | +                                                 | Ethernet MAC 10/100<br>with IEEE 1588                       |
| 82/114/140/168 I/Os                                               | Arm®                                              | MDIO slave                                                  |
| Cyclic redundancy<br>check (CRC)                                  | Cortex <sup>®</sup> -M4<br>240 MHz                | 2x FDCAN<br>(Flexible Data rate)                            |
| Unique ID                                                         |                                                   | 1x USB 2.0 OTG FS/HS                                        |
|                                                                   |                                                   | 1x USB 2.0 OTG FS                                           |
|                                                                   |                                                   | 2x SDMMC                                                    |
|                                                                   | Floating point unit                               | 4x USART + 4 UART<br>LIN, smartcard, IrDA,<br>modem control |
|                                                                   | (DP-FPU)                                          |                                                             |
| Control                                                           | Nested vector                                     | 1x Low-power UART<br>4x SAI                                 |
|                                                                   | interrupt<br>controller (NVIC)                    | (Serial audio interface)                                    |
| 2x 16-bit motor control                                           | JTAG/SW debug/ETM                                 | SPDIF input x4                                              |
| PWM synchronized<br>AC timer                                      | Memory Protection Unit                            | DFSDM (8 inputs/4 filters)                                  |
| 10x 16-bit timers                                                 | (MPU)                                             | SWP<br>(Single Wire Protocol)                               |
| 2x 32-bit timers                                                  | ROP, PC-ROP                                       | (Single Wire Protocol)                                      |
| 5x Low-power timer                                                | anti-tamper                                       | Analog                                                      |
| 16-bit High res. timer                                            |                                                   | 2x 12-bit, 2-channel DACs                                   |
|                                                                   |                                                   | 3 x 16-bit ADC                                              |
|                                                                   | AXI and Multi-AHB                                 | (up to 3.6 Msps)                                            |
|                                                                   | bus matrix                                        | 20 channels/up to 2 MSPS<br>Temperature sensor              |
|                                                                   | 4x DMA                                            | 2x COMP                                                     |
|                                                                   | True random number                                | 2x COMP<br>2x OpAmp                                         |
|                                                                   | generator (RNG)                                   | ZX UPAIIIP                                                  |

**STM32H747xI/G** devices are based on the high-performance **Arm**<sup>®</sup> **Cortex®-M7** and **Cortex®-M4 32-bit RISC** cores. The Cortex®-M7 core operates at up to 480 MHz and the Cortex®-M4 core at up to 240 MHz. Both cores feature a **floating point unit (FPU)** which supports Arm<sup>®</sup> single- and double-precision (Cortex®-M7 core) operations and conversions (IEEE 754 compliant), including a full set of **DSP instructions** and a memory protection unit (MPU) to enhance application security.

STM32H747xI/G devices incorporate high-speed embedded memories with a dual-bank **flash memory of up to 2 Mbytes**, up to **1 Mbyte of RAM** (including 192 Kbytes of **TCM RAM**, up to 864 Kbytes of user **SRAM** and 4 Kbytes of backup SRAM), as well as an extensive range of enhanced **I/Os and peripherals** connected to APB buses, AHB buses, 2x32-bit multi-AHB bus matrix and a multi layer AXI interconnect supporting internal and external memory access.

All the devices offer three **ADCs**, two **DACs**, two ultra-low power **comparators**, a low-power RTC, a high-resolution timer, 12 general-purpose 16-bit **timers**, two PWM timers for motor control, five low-power timers, a true random number generator (RNG). The devices support four digital filters for external sigma-delta modulators (DFSDM). They also feature standard and advanced communication interfaces.









...

#### How do you use **TFL Micro**?



# **TFLite Micro: Interpreter**

## TFLite Micro Design

- TFLite Micro uses an **interpreter** design
- Store the model as data and loop through its ops at runtime



dispatch **loop** 

instruction **ops** 



dispatch **loop** 

instruction **ops** 

### Interpreter (generally slower than compiled code)



code



compiled machine code

Compiler (generally faster than interpreted code)



#### ML is **Different**

• Each layer like a Conv or softmax can take tens of thousands or even millions of cycles to complete execution



#### ML is **Different**

 Parsing overhead is relatively small for the TFMicro interpreter when we consider the overall network graph

| Model                      | Total<br>Cycles | Calculation<br>Cycles | Interpreter<br>Overhead |
|----------------------------|-----------------|-----------------------|-------------------------|
| Visual Wake<br>Words (Ref) | 18,990.8K       | 18,987.1K             | < 0.1%                  |
| Google<br>Hotword<br>(Ref) | 36.4K           | 34.9K                 | 4.1%                    |



#### Sparkfun Edge 2 (Apollo 3 **Cortex-M4**)





#### Interpreter Advantages

Change the model
 without recompiling
 the code

instruction **ops** 



ops



#### Interpreter Advantages

- Change the model
   without recompiling
   the code
- Same operator code
   can be used across
   multiple different
   models in the system

| Arduino      | Himax         |
|--------------|---------------|
| BLE Sense 33 | WE-I Plus EVB |
| Espressif    | SparkFun      |
| EYE          | Edge 2        |

#### Interpreter Advantages

 Same portable model serialization format can be used across a lots of systems.

#### TFLite Micro Interpreter Execution

```
if (op_type == CONV2D) {
   Convolution2d(conv_size, input, output, weights);
} else if (op_type == FULLY_CONNECTED) {
   FullyConnected(input, output, weights)
}
```

## TFLite Micro: Model Format

The FlatBuffer File Format

```
// Map the model into a usable data structure. This doesn't involve any
// copying or parsing, it's a very lightweight operation.
```

// Map the model into a usable data structure. This doesn't involve any
// copying or parsing, it's a very lightweight operation.

// Map the model into a usable data structure. This doesn't involve // copying or parsing, it's a very lightweight operation.

```
model = tflite::GetModel(g_model);
if (model->version() != TFLIGHT_SCHEMA_VERSION) {
                                 "Model provided is schema version %d not equal
```

"to supported version %d.", model->versison(), TFLITE\_SCHEMA\_VERSION);



#### How is **g\_model** stored?





Serialization



#### **Serialization** Libraries

- JSON
- ProtoBuf
- FlatBuffer









# Hardware & Software Limitations

- Limited OS support
- Limited compute
- Limited memory



## What is **g\_model**?

- Array of bytes, and acts as the equivalent of a file on disk
- Holds all of the information about the model, its operators, their connections, and the trained weights

28 alignas(8) const unsigned char g\_model[] = {

## FlatBuffers

• Does **not require copies** to be made before using the data inside the model



# FlatBuffers

- Does **not require copies** to be made before using the data inside the model
- The format is formally specified as a schema file



# FlatBuffers

- Does **not require copies** to be made before using the data inside the model
- The format is formally specified as a schema file
- Schema file is used to automatically generate code to access the information in the model byte array



# g\_model FlatBuffer Format

Metadata (version, quantization ranges, etc)

| Name    | Args | Input | Output | Weights |
|---------|------|-------|--------|---------|
| Conv2D  | 3x3  | 0     | 1      | 2       |
| FC      | -    | 1     | 3      | 4       |
| Softmax | -    | 3     | 5      | -       |

| Weight Buffers |       |                   |  |  |  |
|----------------|-------|-------------------|--|--|--|
| Index          | Туре  | Values            |  |  |  |
| 2              | Float | 0.01, 7.45, 9.23, |  |  |  |
| 4              | Int8  | 34, 19, 243,      |  |  |  |
|                |       |                   |  |  |  |

# TFLite Micro: Memory Allocation

#### The Tensor Arena

# Why Care About **Memory**?

 Embedded systems typically have only hundreds or tens of kilobytes of RAM



# Why Care About **Memory**?

- Embedded systems typically have only hundreds or tens of kilobytes of RAM
- Easy to hit memory limits when building an end-to-end application



# Why Care About **Memory**?

- Embedded systems typically have only hundreds or tens of kilobytes of RAM
- Easy to hit memory limits when building an end-to-end application
- So any framework that integrates with embedded products **must offer control over how memory usage**



# **Long-Running** Applications

- Products are expected to run for months or even years, which poses challenges for memory allocation
- Need to guarantee that memory allocation will not end up fragmented → contiguous memory cannot be allocated even if there's enough memory overall



# Lack of OS Support

- In embedded systems, the standard C and C++ memory APIs (malloc and new) rely on operating system support
- Many devices have no OS, or have very limited functionality



Nano 33 BLE Sense Hardware

# How TFL Micro solves these challenges

1. Ask developers to **supply a contiguous area of memory** to the interpreter, and in return the framework avoids any other memory allocations

```
constexpr int kTensorArenaSize = 2000;
uint8_t tensor_arena[kTensorArenaSize];
```

```
static tflite::MicroInterpreter static_interpreter(model, resolver,
    tensor_arena, kTensorArenaSize, error_reporting);
```

# How TFL Micro solves these challenges

- 1. Ask developers to **supply a contiguous area of memory** to the interpreter, and in return the framework avoids any other memory allocations
- 2. Framework guarantees that it won't allocate from this "arena" after initialization, so long-running applications won't fail due to fragmentation

# How TFL Micro solves these challenges

- 1. Ask developers to **supply a contiguous area of memory** to the interpreter, and in return the framework avoids any other memory allocations
- 2. Framework guarantees that it won't allocate from this "arena" after initialization, so long-running applications won't fail due to fragmentation
- Ensures clear budget for the memory used by ML, and that the framework has no dependency on OS facilities needed by malloc or new

#### uint8\_t tensor\_arena[kTensorArenaSize]

| <b>Operator Variables</b> | Interpreter State | Operator Inputs and<br>Outputs |
|---------------------------|-------------------|--------------------------------|
|---------------------------|-------------------|--------------------------------|

## Arena **size**?

 Depends on what ops are in the model (and the parameters of those operations)

#### constexpr int kTensorArenaSize = 2000;

uint8\_t tensor\_arena[kTensorArenaSize];

# Arena **size**?

- Depends on what ops are in the model (and the parameters of those operations)
- Size of operator inputs and outputs is platform independent, but different devices can have different operator implementations

#### constexpr int kTensorArenaSize = 2000;

uint8\_t tensor\_arena[kTensorArenaSize];

# Arena **size**?

- Depends on what ops are in the model (and the parameters of those operations)
- Size of operator inputs and outputs is platform independent, but different devices can have different operator implementations
- → hard to forecast exact
   size of arena needed

#### constexpr int kTensorArenaSize = 2000;

uint8\_t tensor\_arena[kTensorArenaSize];

# Solution

- Create as large an arena as you can and run your program on-device
- Use the arena\_used\_bytes() function to get the actual size used.
- Resize the arena to that length and rebuild
- Best to do this on your deployment platform, since different op implementations may need varying scratch buffer sizes

constexpr int kTensorArenaSize = 6000; uint8\_t tensor\_arena[kTensorArenaSize];

# TFLite Micro: NN Operations

The OpsResolver

# Why Care About **Binary Size**?

• Executable code used by a framework takes up space in Flash

# Why Care About **Binary Size**?

- **Executable code** used by a framework takes up space in Flash
- Flash is a limited resource on embedded devices and often just tens of kilobytes available



# Why Care About **Binary Size**?

- **Executable code** used by a framework takes up space in Flash
- Flash is a limited resource on embedded devices and often just tens of kilobytes available
- If compiled code is too large, it won't be usable by applications.



**Optimizing** Operator Usage in TFL Micro

 There are many operators in TensorFlow (~1400 and growing)



# **Optimizing** Operator Usage in TFL Micro

- There are many operators in TensorFlow (~1400 and growing)
- Not all operators are used or even needed to perform inference



# **Optimizing** Operator Usage in TFL Micro

- There are many operators in TensorFlow (~1400 and growing)
- Not all operators are used or even needed to perform inference
- Bring in or load only those that are important to conserve memory usage



# How to **Reduce the Size** Taken by Ops?

Allow developers to specify which ops they want to be included in the binary

```
tflite::MicroMutableOpResolver<4>
op_resolver(error_reporter);
if (op_resolver.AddDepthwiseConv2D() != kTfLiteOk) {
    return;
}
```







#### static tflite::MicroMutableOpResolver<4> micro\_op\_resolver(error\_reporter);

```
if (micro_op_resolver.AddDepthwiseConv2D() != kTfLiteOk) {
   return;
}
```

```
if (micro_op_resolver.AddFullyConnected() != kTfLiteOk) {
  return;
```

```
if (micro_op_resolver.AddSoftmax() != kTfLite0k) {
    return;
```

```
if (micro_op_resolver.AddReshape() != kTfLiteOk) {
  return;
```



static tflite::MicroMutableOpResolver<4> micro\_op\_resolver(error\_reporter);

#### if (micro\_op\_resolver.AddDepthwiseConv2D() != kTfLite0k) { return;

if (micro\_op\_resolver.AddFullyConnected() != kTfLiteOk) {
 return;

```
.
```

```
if (micro_op_resolver.AddSoftmax() != kTfLite0k) {
  return;
```

```
if (micro_op_resolver.AddReshape() != kTfLite0k) {
   return;
```



if (micro\_op\_resolver.AddDepthwiseConv2D() != kTfLiteOk) { return; if (micro\_op\_resolver.AddFullyConnected() != kTfLiteOk) { return; if (micro\_op\_resolver.AddSoftmax() != kTfLiteOk) { return; if (micro\_op\_resolver.AddReshape() != kTfLiteOk) { return;

static tflite::MicroMutableOpResolver<4> micro\_op\_resolver(error\_reporter);

# **FullyConnected W** (4×4000) **B**(4)

```
if (micro_op_resolver.AddDepthwiseConv2D() != kTfLiteOk) {
  return;
if (micro_op_resolver.AddFullyConnected() != kTfLiteOk) {
  return;
if (micro_op_resolver.AddSoftmax() != kTfLiteOk) {
  return;
if (micro_op_resolver.AddReshape() != kTfLiteOk) {
  return;
```

static tflite::MicroMutableOpResolver<4> micro\_op\_resolver(error\_reporter);

# Softmax

```
static tflite::MicroMutableOpResolver<4> micro_op_resolver(error_reporter);
if (micro_op_resolver.AddDepthwiseConv2D() != kTfLiteOk) {
  return;
if (micro_op_resolver.AddFullyConnected() != kTfLiteOk) {
  return;
if (micro_op_resolver.AddSoftmax() != kTfLiteOk) {
  return;
if (micro_op_resolver.AddReshape() != kTfLiteOk) {
  return;
```

# Which Ops to Include?



https://netron.app



If memory is not an issue, you can choose to simply include all operators, both used and unused, at the expense of increased memory consumption

static tflite::AllOpsResolver resolver; // Build an interpreter to run the model with. static tflite::MicroInterpreter static\_interpreter( model, resolver, tensor\_arena, kTensorArenaSize, error\_reporter); interpreter = &static\_interpreter;

# Memory Improvements

- Selective op registration reduces memory consumption by 30%
- Memory reduction varies by model, depending on the operators used by the model



# In Summary, what is TensorFlow Lite Micro?

**Compatible** with the TensorFlow training environment. Built to fit on **embedded systems**:



- Very small binary footprint
- No dynamic memory allocation
- **No** dependencies on complex parts of the standard C/C++ libraries
- No operating system dependencies, can run on bare metal
- Designed to be **portable** across a wide variety of systems

# Thank You!

# Paper Discussions

## Paper Discussion #1 - TFLite



### Paper Discussion #2 - MCUNet



# **Guest Speaker**

# Tianqi Chen

Tiangi Chen is a distinguished researcher, primarily recognized as the creator of TVM, an open-source machine learning compiler stack, designed to enable efficient deployment of deep learning models on a variety of hardware platforms. Chen's contributions have been pivotal in progressing machine learning frameworks.



Personal Website

**Google Scholar**