ML Code Optimisation & Generation Papers (2020~2024)

Author : Yucheol Choi (University of Leeds, PhD Student) - supervised by Zheng Wang
- https://www.linkedin.com/in/yucheol-choi/
Research Keywords : Operating Fusion, Layer Fusion, Kernel Optimization, Compiler Auto-tuning, Graph Optimization, Memory Optimization, HW Accelerator Mapping, Automatic Kernel Generation and Optimization, Code Size Reduction
Conference & Journal : NeurIPS, ICML, HPCA, ISCA, CC, CGO, MLSys, PLDI, PACT, ASPLOS, TACO, LCTES, TSEM, ICCTA, CF, TPDS, DATE, ICPP, DAC, ICS, ISSTA, OSDI, SOSP

https://github.com/zwang4/awesome-machine-learning-in-compilers

Compiler Autotuning through Multiple-phase Learning (TSEM 24)

To reduce the heavy runtime cost, it propose a lightweight learning approach that uses a small number of actual runtime performance data to predict the runtime performance of a compiled program with various optimization flag combinations. Furthermore, to reduce the search space, we design a novel particle swarm algorithm that tunes compiler optimization flags with the prediction model. To evaluate the performance of the proposed approach, CompTuner, it conduct an extensive experimental study on two popular C compilers, GCC and LLVM, with two widely used benchmarks, cBench and PolyBench.
Layer-wise Exploration of a Neural Processing Unit Compiler's Optimization Space (ICCTA 24)

To address the huge space of parameters, It propose a greedy algorithm that iterates through the convolutional layers of the network, while preserving a set of solutions for the preceding layers. We evaluated this approach by transforming the graphs of some popular neural networks to optimize their performance and memory footprint, mapping them onto an experimental embedded NPU developed by STMicroelectronics using its associated neural network compiler.
oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation (CGO 24)

a tensor compiler that employs a hybrid approach of using techniques from both compiler optimization and expert-tuned kernels for high performance code generation of the deep neural network graph. oneDNN Graph Compiler addresses unique optimization challenges in the deep learning domain, such as low-precision computation, aggressive fusion of graph operations, optimization for static tensor shapes and memory layout, constant weight optimization, and memory buffer reuse.
DNNOPT: A Framework for Efficiently Selecting On-chip Memory Loop Optimizations of DNN Accelerators (CF 24)

This work proposes a framework for analyzing the flow of values and their re-use in loop nests to minimize data traffic under the constraints of limited on-chip memory capacity and dependences.
DeFiNES: Enabling Fast Exploration of the Depth-first Scheduling Space for DNN Accelerators through Analytical Modeling. (HPCA 2023)

After formalizing this design space, this work proposes a unified modeling framework, DeFiNES, for layer-by-layer and depth-first scheduling to fill in the gaps. DeFiNES enables analytically estimating the hardware cost for possible schedules in terms of both energy and latency, while considering data access at every memory level. This is done for each schedule and HW architecture under study by optimally choosing the active part of the memory hierarchy per unique combination of operand, layer, and feature map tile. The hardware costs are estimated, taking into account both data computation and data copy phases. The analytical cost model is validated against measured data from a taped-out depth-first DNN accelerator, DepFiN, showing good modeling accuracy at the end-to-end neural network level.
NeoFlow: A Flexible Framework for Enabling Efficient Compilation for High Performance DNN Training (TPDS 2022) NeoFlow allows the programmers to directly write customized expressions as new operators to be mapped to graph representation and low-level implementations automatically, providing both high programming productivity and high performance. First, NeoFlow provides expression-based automatic differentiation to support customized model definitions with new operators. Then, NeoFlow proposes an efficient compilation system that partitions the neural network graph into subgraphs, explores optimized schedules, and generates high-performance libraries for subgraphs automatically. Finally, NeoFlow develops an efficient runtime system to combine the compilation and training as a whole by overlapping their execution. In the experiments, we examine the numerical accuracy and performance of NeoFlow.
All you need is superword-level parallelism: systematic control-flow vectorization with SLP (PLDI 2022)

Larsen and Amarasinghe originally proposed using SLP vectorization (together with loop unrolling) as a simpler, more flexible alternative to traditional loop vectorization. However, this vision of replacing traditional loop vectorization has not been realized because SLP vectorization cannot directly reason with control flow. a new vectorization framework that generalizes SLP vectorization to uncover parallelism that spans different basic blocks and loop nests. With the capability to systematically vectorize instructions across control-flow regions such as basic blocks and loops, our framework simultaneously subsumes the roles of inner-loop, outer-loop, and straight-line vectorizer while retaining the flexibility of SLP vectorization (e.g., partial vectorization)
moTuner: a compiler-based auto-tuning approach for mixed-precision operators (CF 22)

an automatic framework for efficiently tuning mixed-precision operators. moTuner works on compiler-level to automatically enable the mixed-precision computation, without involving any manual modifications of source code and/or the operator library, thus significantly alleviating the programming burden. Owing to be implemented in compilation phase, moTuner can be more widely applicable with lessened efforts on the libraries. Further, moTuner adopts optimized search strategy in tuning to effectively narrow down the configuration space.
BaCO: A Fast and Portable Bayesian Compiler Optimization Framework (ASPLOS 23)

a general purpose autotuner for modern compilers targeting CPUs, GPUs, and FPGAs. it deals with permutation, ordered, and continuous parameter types along with both known and unknown parameter constraints. It deals with permutation, ordered, and continuous parameter types along with both known and unknown parameter constraints. To reason about these parameter types and efficiently deliver high-quality code, BaCO uses Bayesian optimization algorithms specialized towards the autotuning domain.
Revealing Compiler Heuristics Through Automated Discovery and Optimization (CGO 24)

Manually discovering all of these heuristics hidden among millions of lines of code and exposing them to auto-tuning tools is a Herculean task that is simply not practical. What is needed is a method of automatically finding these heuristics to extract every last drop of potential optimization. a framework that automatically identifies potential heuristics in the compiler that are highly profitable optimization targets and then automatically finds available tuning parameters for those heuristics with minimal human involvement.
Pin or Fuse? Exploiting Scratchpad Memory to Reduce Off-Chip Data Transfer in DNN Accelerators (CGO 23)

Most ASIC accelerators are equipped with compiler-controlled scratchpad memory (SPM) used as a last-level cache to reduce the number of accesses to off-chip memory. A widely-used strategy for utilizing SPM is fused-layer execution, which divides a DNN model into groups of layers and forwards the intermediate results within each group without eviction to the off-chip memory. However, layer fusion has an inherent limitation that the fusion of consecutive layers increases the amount of computations, leading to sub-optimal performance. This paper introduces a new dimension to SPM usage, which temporarily pins a feature map on SPM. Pinning reduces off-chip transfer without computation increase, but it is not applicable to all feature maps due to limited SPM size. We find that superior performance can be achieved by combination of pinning and fusion in MobileNet. Based on this observation, It propose a model-level optimization method that jointly applies pinning and fusion to minimize inference latency under memory constraints. Scheduling and allocation schemes are presented for automatic generation of optimized codes.
DLAS: An Exploration and Assessment of the Deep Learning Acceleration Stack (2023)

Significant bodies of work from both the machine learning and systems communities have attempted to provide optimizations to accelerate DNNs. DLAS can be a valuable concept for exploring the next generation of co-designed accelerated deep learning solutions.
DietCode: Automatic Optimization for Dynamic Tensor Programs (MLSys 22)

a new auto-scheduler framework that efficiently supports dynamic-shape workloads by constructing a shape-generic search space and cost model. Under this construction, all shapes jointly search within the same space and update the same cost model when auto-scheduling, which is therefore more efficient compared with existing auto-schedulers