my master thesis explored how to
make deep learning models practical on embedded hardware like fpgas,
where memory and compute are scarce
i worked on three classic compression families: quantization, pruning,
and low-rank approximation, and proposed a unified method that lets a
network learn how much to quantize, what to prune, and when to factorize
inside a single training loop
the goal: simpler tuning and stable, high compression that maps directly to fpga costs
the three compression techniques
1. quantization
represent weights and activations with fewer bits (b), cutting memory footprint and replacing
floating point ops with integer ones
on fpgas this directly lowers cost
canonical uniform quantization symmetric
Q(x) = \text{clamp}\left(\left\lfloor \frac{x}{S} \right\rceil, \tau_l(b), \tau_h(b)\right), \quad S = \frac{\beta - \alpha}{2^b - 1}
with integer ranges (\tau_l, \tau_h)
my variant keeps bit widths b as trainable floats and prunes when b < \varepsilon
x_q = \alpha \cdot Q_b\left(\frac{x}{\alpha}\right), \quad Q_b(x) = \begin{cases} \text{clamp}(\lfloor x \rceil, \tau_l(b), \tau_h(b)), & b \ge \varepsilon \\ 0, & b < \varepsilon \end{cases}
this trick makes compression differentiable and stable during training
2. pruning
remove parameters that matter least for the task
- unstructured: drops individual weights which yields high sparsity but hardware rarely exploits it
- structured: removes whole neurons, channels, or filters which leads to real speedups on fpgas
criteria include magnitude (L_p) norms and APoZ (Average Percentage of Zeros) among others
3. low-rank approximation
factorize a big weight tensor into smaller ones for example svd this cuts parameters and multiply adds from (mn) to (k(m+n)) when (k \ll \min(m,n))
why combine them under one training objective
mixing methods naively is order sensitive because prune then quantize
is not the same as quantize then prune
my thesis reframed pruning and low rank as emerging from the
quantization view, all driven by trainable bit widths
unified objective
\min_\theta (1-\lambda) \mathcal{L}(\theta; x, y) + \lambda R(\theta_B)
where (\theta_B) are trainable bit
widths
the polarization regularizer
R(B) = \gamma\lvert B\rvert_2 + t\lvert B\rvert_1 - \lvert B - \bar{B}\rvert_1
pushes unimportant elements toward zero while clustering the rest, yielding a clean keep or drop separation
why this matters for fpgas: costs scale with bit width
layer cost in bit ops bops is
\text{BOPS}(l) = \text{FLOPS}(l) \cdot b_w \cdot b_a
so learning b_w, b_a while pruning structure attacks cost directly
the unified pipeline
- optional decompose layers to expose rank options
- train with float bit widths and polarization and auto prune when (b < \varepsilon)
- recompose if factorization is not beneficial
- uniformize per layer settings and round and freeze and quick fine tune then deploy
challenges
getting the penalty for bit widths right is critical
too weak and the model never compresses
too strong and it collapses and loses accuracy
balancing task loss and compression regularizer required careful sweeps
and multiple experiments
results
on lenet 5 with mnist the unified method held accuracy even plus 0.01
percent while using only about 0.78 percent of original bops
on vgg 7 with cifar 10 it remained competitive versus mixed precision
baselines
summary
- quantization reduces precision
- pruning unstructured versus structured removes parameters
- low rank factors heavy layers
- the method unifies them under trainable bit widths with polarization so pruning and rank selection fall out naturally
- this maps cleanly to fpga costs via bops and avoids brittle order dependent pipelines
patent
this work was done jointly with Benoit Porteboeuf who was my tutor
throughout the internship and provided invaluable guidance
it resulted in a patent filing that covers the unified compression
framework
the patent describes the method for jointly learning quantization
levels, pruning decisions, and low rank factorizations during neural
network training which makes it easier to deploy compressed models on
resource constrained hardware