mm. | a unified method for neural network compression

my master thesis explored how to make deep learning models practical on embedded hardware like fpgas, where memory and compute are scarce
i worked on three classic compression families: quantization, pruning, and low-rank approximation, and proposed a unified method that lets a network learn how much to quantize, what to prune, and when to factorize inside a single training loop

the goal: simpler tuning and stable, high compression that maps directly to fpga costs

the three compression techniques

1. quantization

represent weights and activations with fewer bits ( $b$ ), cutting memory footprint and replacing floating point ops with integer ones
on fpgas this directly lowers cost

canonical uniform quantization symmetric

Q(x) = \text{clamp}\left(\left\lfloor \frac{x}{S} \right\rceil, \tau_l(b), \tau_h(b)\right), \quad S = \frac{\beta - \alpha}{2^b - 1}

with integer ranges $(\tau_l, \tau_h)$

my variant keeps bit widths $b$ as trainable floats and prunes when $b < \varepsilon$

x_q = \alpha \cdot Q_b\left(\frac{x}{\alpha}\right), \quad Q_b(x) = \begin{cases} \text{clamp}(\lfloor x \rceil, \tau_l(b), \tau_h(b)), & b \ge \varepsilon \\ 0, & b < \varepsilon \end{cases}

this trick makes compression differentiable and stable during training

Quantization methods — uniform vs non-uniform quantization levels

2. pruning

remove parameters that matter least for the task

unstructured: drops individual weights which yields high sparsity but hardware rarely exploits it
structured: removes whole neurons, channels, or filters which leads to real speedups on fpgas

criteria include magnitude ( $L_p$ ) norms and APoZ (Average Percentage of Zeros) among others

Pruning strategies — unstructured vs structured pruning strategies

3. low-rank approximation

factorize a big weight tensor into smaller ones for example svd this cuts parameters and multiply adds from ( $mn$ ) to ( $k(m+n)$ ) when ( $k \ll \min(m,n)$ )

SVD factorization — svd factorization of weight matrix

why combine them under one training objective

mixing methods naively is order sensitive because prune then quantize is not the same as quantize then prune
my thesis reframed pruning and low rank as emerging from the quantization view, all driven by trainable bit widths

unified objective

\min_\theta (1-\lambda) \mathcal{L}(\theta; x, y) + \lambda R(\theta_B)

where ( $\theta_B$ ) are trainable bit widths
the polarization regularizer

R(B) = \gamma\lvert B\rvert_2 + t\lvert B\rvert_1 - \lvert B - \bar{B}\rvert_1

pushes unimportant elements toward zero while clustering the rest, yielding a clean keep or drop separation

why this matters for fpgas: costs scale with bit width
layer cost in bit ops bops is

\text{BOPS}(l) = \text{FLOPS}(l) \cdot b_w \cdot b_a

so learning $b_w, b_a$ while pruning structure attacks cost directly

the unified pipeline

optional decompose layers to expose rank options
train with float bit widths and polarization and auto prune when ( $b < \varepsilon$ )
recompose if factorization is not beneficial
uniformize per layer settings and round and freeze and quick fine tune then deploy

Compression pipeline — compression pipeline workflow

challenges

getting the penalty for bit widths right is critical
too weak and the model never compresses
too strong and it collapses and loses accuracy
balancing task loss and compression regularizer required careful sweeps and multiple experiments

results

on lenet 5 with mnist the unified method held accuracy even plus 0.01 percent while using only about 0.78 percent of original bops
on vgg 7 with cifar 10 it remained competitive versus mixed precision baselines

summary

quantization reduces precision
pruning unstructured versus structured removes parameters
low rank factors heavy layers
the method unifies them under trainable bit widths with polarization so pruning and rank selection fall out naturally
this maps cleanly to fpga costs via bops and avoids brittle order dependent pipelines

patent

this work was done jointly with Benoit Porteboeuf who was my tutor throughout the internship and provided invaluable guidance
it resulted in a patent filing that covers the unified compression framework
the patent describes the method for jointly learning quantization levels, pruning decisions, and low rank factorizations during neural network training which makes it easier to deploy compressed models on resource constrained hardware

patent filing WO2025051862A1