Convolutional Neural Networks and Related Topics Notes
DSE (Design Space Exploration)
Bypass connections, sensitivity analysis (explore DSE for pruning). Adding parameters to most sensitive units.
CNN DSE: microarchitectural exploration (inside layers), macroarchitectural exploration (combination of modules), model compression.
- Bayesian optimization: Practical Bayesian Optimization of Machine Learning Algorithms
- Simulated Annealing: An Optimization Method for Neural Network Weights and Architecture
- Randomized Search
- Genetic Algorithms: Evolving Neural Network Through Augmenting Topologies
Inspirations: DSD training; Re-densifying and retraining from a sparse model can improve accuracy.
Winograd's minimal filtering algorithm.
1. Train with group sparsity regularizer --> 2. sparsifying with groupwise brain damage. Fine tuning, Gradual groupwise sparsification.
Result: Shrink receptive fields towards center and make them circular.
A new architecture to ease gradient-based training of very deep networks.
Affine transform H. Transform Gate T. Carry gate C.
yi = Hi(x)Ti(x) + xi(1-Ti(x))
8-bit data type: use bits of mantissa to represent a binary tree with interval (0.1, 1) which is bisected according to the route taken through the tree.
Hardware acceleratioin; loosen ineffectual computation.
Propose data structure format to enable seamless elimination of most zero-operand multiplications.
Zero skipping: 1. lane decoupling: CNV dynamic hardware approach where zero neurons are eliminated at output; only nonzeros neurons appear in NBin.
2. Storing input on-the-fly in appropriate format (ZFNAf).
Accelerates code that can tolerate imprecise execution.
Utilizes an algorithmic transformation that automatically converts approximable regions of code from a von Neumann model to "analog" neural model.
Result: 3.7x speed up, 6.3x energy saving, <10% quality loss.
Binary-Weight-Network & XNOR-Networks: both filters and input to conv layers are binary. 58x faster conv ops and 32x memory saving.
For efficient training and inference: designing compact layers, quantizing params, network binarization.
Result: performance little bit worse than full-precision networks.
Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or -1
Reduce memory consumption and time complexity. Use shift based batch normalizing transform.
- Network Pruning
- Deep Compression
Size, speed, power. Focus on end-user's off-line learning, just for feed-forward.
Components: NBin (neuron buffers), NBout, SB (synaptic weights buffer), NFU (Neural Functional Unit), Control Logic (CP). Use scratchpads for storage.
Co-design flow, algorithsm, architecture, circuit layers. Low-power accelerator, high accuracy.
Heterogeneous data type quantization; dynamic operation pruning; algorithm-aware fault mitigation.
Keras: software simulation. Aladdin: accelerator DSE. Fault mitigation: low SRAM supply voltage charge --> higher bit faults.
Detection: Use Razor double-sampling method.
Correction: Bitmasking, replace fault bits with the sign bit.
Standout networks, support "regularization by noise".
Analyze performance and short comings of LSTMs. Sequence learning.
GRU: Gated Recurrent Unit. LSTM: use memory cells to remember long-range information and keep track of various attributes of text it is currently processing.
Introduce a multimodal RNN architecture.
Bidirectional recurrent neural network. Related to another paper (Expressing an Image Stream with a Sequence of Natural Sentences) about blog image description generation.
Take out: use multiple saccades around the image to identify all entities, their mutual interactions and wider context before generating a description.
2 criteria: global ranking objective -> image-sentence pairs; fragment alignment objective -> learn appearance of sentence fragments.
Completely unlabeled data by unsupervised feature learning methods can learn high-level features.
Points: extremely large datasets; very large number of features.
Discovery: able to discover object-selective features with no labeled data, potentially perform better than basic supervised detectors.
1. Learn selective features by the importance of encoding vs. training with sparse coding and vector quantization. (k-means)
2. Combine selective features into invariant features (max-pooling).
Introduce DT (dependency-tree) - RNN models --> better abstract from details of word order and syntactic expressions.
Use multimodal embeddings, semantic DT-RNNs.
1. Image learning: first trained using unsupervised objective (train on randomly sampled images from web).
2. Multimodal mapping: inner products. Use ImageNet to adjust features.
Binary Network Series
- Binary Quantization
- Expectation Backpropagation: Parameter-Free Training of Multilayer Neural Networks with Continuous or Discrete Weights
- Backpropagation for Energy-Efficient Neuromorphic Computing
- BinaryConnect: Training Deep Neural Networks with Binary Weights During Propagations (not very successful on large-scale datasets)
- RedEye: Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision
- Learning to Detect Unseen Object Classes by Between-Class Attribute Transfer
- Quantized Convolutional Neural Networks for Mobile Devices
- DEEP-CARVING: Discovering Visual Attributes by Carving Deep Neural Nets
- Expressing an Image Stream with a Sequence of Natural Sentences
- CryptoNets: Applying Neural Networks to Encrypted Data with High Throughput and Accuracy
- Understanding Deep Image Representations by Inverting Them