# Convolutional Neural Networks and Related Topics Notes

## DSE (Design Space Exploration)

#### SqueezeNet

Bypass connections, sensitivity analysis (explore DSE for pruning). Adding parameters to most sensitive units.

CNN DSE: microarchitectural exploration (inside layers), macroarchitectural exploration (combination of modules), model compression.

Related methods:

- Bayesian optimization: Practical Bayesian Optimization of Machine Learning Algorithms
- Simulated Annealing: An Optimization Method for Neural Network Weights and Architecture
- Randomized Search
- Genetic Algorithms: Evolving Neural Network Through Augmenting Topologies

Inspirations: DSD training; Re-densifying and retraining from a sparse model can improve accuracy.

## Speed

### Fast Algorithms

#### Fast Algorithms for CNN

Winograd's minimal filtering algorithm.

#### Fast ConvNets Using Groupwise Brain Damage

1. Train with group sparsity regularizer --> 2. sparsifying with groupwise brain damage. Fine tuning, Gradual groupwise sparsification.

Result: Shrink receptive fields towards center and make them circular.

#### Fast Training of CNN through FFTs

#### Minimizing Computation in CNN

#### Highway Networks

A new architecture to ease gradient-based training of very deep networks.

Affine transform H. Transform Gate T. Carry gate C.

y_{i} = H_{i}(x)T_{i}(x) + x_{i}(1-T_{i}(x))

### Distributed/Parallel Computing

#### 8-Bit Approximation for Parallelism in Deep Learning

8-bit data type: use bits of mantissa to represent a binary tree with interval (0.1, 1) which is bisected according to the route taken through the tree.

### Hardware Accelerator

#### Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing

Hardware acceleratioin; loosen ineffectual computation.

Propose data structure format to enable seamless elimination of most zero-operand multiplications.

Zero skipping: 1. lane decoupling: CNV dynamic hardware approach where zero neurons are eliminated at output; only nonzeros neurons appear in NB_{in}.

2. Storing input on-the-fly in appropriate format (ZFNAf).

#### General-Purpose Code Acceleration with Limited-Precision Analog Computation

Accelerates code that can tolerate imprecise execution.

Utilizes an algorithmic transformation that automatically converts approximable regions of code from a von Neumann model to "analog" neural model.

Result: 3.7x speed up, 6.3x energy saving, <10% quality loss.

#### Origami: A 803 GOp/s/W Convolutional Networks Accelerator

## Size

### Network/Software Level

#### XNOR-Net: ImageNet Classification Using Binary CNNs

Binary-Weight-Network & XNOR-Networks: both filters and input to conv layers are binary. 58x faster conv ops and 32x memory saving.

For efficient training and inference: designing compact layers, quantizing params, network binarization.

Result: performance little bit worse than full-precision networks.

#### Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or -1

Reduce memory consumption and time complexity. Use shift based batch normalizing transform.

#### Model Compression

- SqueezeNet
- Network Pruning
- Quantization
- Deep Compression

### Memory/Hardware Level

#### DianNao series

Size, speed, power. Focus on end-user's off-line learning, just for feed-forward.

Components: NBin (neuron buffers), NBout, SB (synaptic weights buffer), NFU (Neural Functional Unit), Control Logic (CP). Use scratchpads for storage.

#### Minerva DNN Accelerator

Co-design flow, algorithsm, architecture, circuit layers. Low-power accelerator, high accuracy.

Heterogeneous data type quantization; dynamic operation pruning; algorithm-aware fault mitigation.

Keras: software simulation. Aladdin: accelerator DSE. Fault mitigation: low SRAM supply voltage charge --> higher bit faults.

Detection: Use Razor double-sampling method.

Correction: Bitmasking, replace fault bits with the sign bit.

#### EIE

#### Eyeriss

## Accuracy

#### Adaptive Dropout

Standout networks, support "regularization by noise".

#### ResNet

## Others

#### Visualizing and Understanding Recurrent Networks

Analyze performance and short comings of LSTMs. Sequence learning.

GRU: Gated Recurrent Unit. LSTM: use memory cells to remember long-range information and keep track of various attributes of text it is currently processing.

#### Deep Visual-Semantic Alignments for Generating Image Descriptions

Introduce a multimodal RNN architecture.

Bidirectional recurrent neural network. Related to another paper (*Expressing an Image Stream with a Sequence of Natural Sentences*) about blog image description generation.

Take out: use multiple saccades around the image to identify all entities, their mutual interactions and wider context before generating a description.

#### Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

2 criteria: global ranking objective -> image-sentence pairs; fragment alignment objective -> learn appearance of sentence fragments.

#### Emergence of Object-Selective Features in Unsupervised Feature Learning

Completely unlabeled data by unsupervised feature learning methods can learn high-level features.

Points: extremely large datasets; very large number of features.

Discovery: able to discover object-selective features with no labeled data, potentially perform better than basic supervised detectors.

1. Learn selective features by the importance of encoding vs. training with sparse coding and vector quantization. (k-means)

2. Combine selective features into invariant features (max-pooling).

#### Grounded Compositional Semantics for Finding and Describing Images with Sentences

Introduce DT (dependency-tree) - RNN models --> better abstract from details of word order and syntactic expressions.

Use multimodal embeddings, semantic DT-RNNs.

1. Image learning: first trained using unsupervised objective (train on randomly sampled images from web).

2. Multimodal mapping: inner products. Use ImageNet to adjust features.

### Binary Network Series

- Binary Quantization
- Expectation Backpropagation: Parameter-Free Training of Multilayer Neural Networks with Continuous or Discrete Weights
- Backpropagation for Energy-Efficient Neuromorphic Computing
- BinaryConnect: Training Deep Neural Networks with Binary Weights During Propagations (not very successful on large-scale datasets)

### Related Papers

- RedEye: Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision
- Learning to Detect Unseen Object Classes by Between-Class Attribute Transfer
- Quantized Convolutional Neural Networks for Mobile Devices
- DEEP-CARVING: Discovering Visual Attributes by Carving Deep Neural Nets
- Expressing an Image Stream with a Sequence of Natural Sentences
- CryptoNets: Applying Neural Networks to Encrypted Data with High Throughput and Accuracy
- Understanding Deep Image Representations by Inverting Them