a unified method for neural network compression

2025-10-02

my master thesis explored how to make deep learning models practical on embedded hardware like fpgas, where memory and compute are scarce
i worked on three classic compression families: quantization, pruning, and low-rank approximation, and proposed a unified method that lets a network learn how much to quantize, what to prune, and when to factorize inside a single training loop

the goal: simpler tuning and stable, high compression that maps directly to fpga costs

the three compression techniques

1. quantization

represent weights and activations with fewer bits (bb), cutting memory footprint and replacing floating point ops with integer ones
on fpgas this directly lowers cost

canonical uniform quantization symmetric

Q(x)=clamp(xS,τl(b),τh(b)),S=βα2b1Q(x) = \text{clamp}\left(\left\lfloor \frac{x}{S} \right\rceil, \tau_l(b), \tau_h(b)\right), \quad S = \frac{\beta - \alpha}{2^b - 1}

with integer ranges (τl,τh)(\tau_l, \tau_h)

my variant keeps bit widths bb as trainable floats and prunes when b<εb < \varepsilon

xq=αQb(xα),Qb(x)={clamp(x,τl(b),τh(b)),bε0,b<εx_q = \alpha \cdot Q_b\left(\frac{x}{\alpha}\right), \quad Q_b(x) = \begin{cases} \text{clamp}(\lfloor x \rceil, \tau_l(b), \tau_h(b)), & b \ge \varepsilon \\ 0, & b < \varepsilon \end{cases}

this trick makes compression differentiable and stable during training

uniformnon-uniform
uniform vs non-uniform quantization levels

2. pruning

remove parameters that matter least for the task

  • unstructured: drops individual weights which yields high sparsity but hardware rarely exploits it
  • structured: removes whole neurons, channels, or filters which leads to real speedups on fpgas

criteria include magnitude (LpL_p) norms and APoZ (Average Percentage of Zeros) among others

unstructured(random weights)structured(columns/rows)
unstructured vs structured pruning strategies

3. low-rank approximation

factorize a big weight tensor into smaller ones for example svd
this cuts parameters and multiply adds from (mnmn) to (k(m+n)k(m+n)) when (kmin(m,n)k \ll \min(m,n))

W(m×n)U(m×k)S(k×k)Vᵀ(k×n)
svd factorization of weight matrix

why combine them under one training objective

mixing methods naively is order sensitive because prune then quantize is not the same as quantize then prune
my thesis reframed pruning and low rank as emerging from the quantization view, all driven by trainable bit widths

unified objective

minθ(1λ)L(θ;x,y)+λR(θB)\min_\theta (1-\lambda) \mathcal{L}(\theta; x, y) + \lambda R(\theta_B)

where (θB\theta_B) are trainable bit widths
the polarization regularizer

R(B)=γB2+tB1BBˉ1R(B) = \gamma\lvert B\rvert_2 + t\lvert B\rvert_1 - \lvert B - \bar{B}\rvert_1

pushes unimportant elements toward zero while clustering the rest, yielding a clean keep or drop separation

why this matters for fpgas: costs scale with bit width
layer cost in bit ops bops is

BOPS(l)=FLOPS(l)bwba\text{BOPS}(l) = \text{FLOPS}(l) \cdot b_w \cdot b_a

so learning bw,bab_w, b_a while pruning structure attacks cost directly

the unified pipeline

  1. optional decompose layers to expose rank options
  2. train with float bit widths and polarization and auto prune when (b<εb < \varepsilon)
  3. recompose if factorization is not beneficial
  4. uniformize per layer settings and round and freeze and quick fine tune then deploy
baselinedecomposetrain + prune(b < ε)recomposeuniformizedeploy
compression pipeline workflow

challenges

getting the penalty for bit widths right is critical
too weak and the model never compresses
too strong and it collapses and loses accuracy
balancing task loss and compression regularizer required careful sweeps and multiple experiments

results

on lenet 5 with mnist the unified method held accuracy even plus 0.01 percent while using only about 0.78 percent of original bops
on vgg 7 with cifar 10 it remained competitive versus mixed precision baselines

summary

  • quantization reduces precision
  • pruning unstructured versus structured removes parameters
  • low rank factors heavy layers
  • the method unifies them under trainable bit widths with polarization so pruning and rank selection fall out naturally
  • this maps cleanly to fpga costs via bops and avoids brittle order dependent pipelines

patent

this work was done jointly with Benoit Porteboeuf who was my tutor throughout the internship and provided invaluable guidance
it resulted in a patent filing that covers the unified compression framework
the patent describes the method for jointly learning quantization levels, pruning decisions, and low rank factorizations during neural network training which makes it easier to deploy compressed models on resource constrained hardware