my master thesis explored how to make deep learning models practical on embedded hardware like fpgas, where memory and compute are scarce
i worked on three classic compression families: quantization, pruning, and low-rank approximation, and proposed a unified method that lets a network learn how much to quantize, what to prune, and when to factorize inside a single training loop
the goal: simpler tuning and stable, high compression that maps directly to fpga costs
the three compression techniques
1. quantization
represent weights and activations with fewer bits (), cutting memory footprint and replacing floating point ops with integer ones
on fpgas this directly lowers cost
canonical uniform quantization symmetric
with integer ranges
my variant keeps bit widths as trainable floats and prunes when
this trick makes compression differentiable and stable during training
2. pruning
remove parameters that matter least for the task
- unstructured: drops individual weights which yields high sparsity but hardware rarely exploits it
- structured: removes whole neurons, channels, or filters which leads to real speedups on fpgas
criteria include magnitude () norms and APoZ (Average Percentage of Zeros) among others
3. low-rank approximation
factorize a big weight tensor into smaller ones for example svd
this cuts parameters and multiply adds from () to () when ()
why combine them under one training objective
mixing methods naively is order sensitive because prune then quantize is not the same as quantize then prune
my thesis reframed pruning and low rank as emerging from the quantization view, all driven by trainable bit widths
unified objective
where () are trainable bit widths
the polarization regularizer
pushes unimportant elements toward zero while clustering the rest, yielding a clean keep or drop separation
why this matters for fpgas: costs scale with bit width
layer cost in bit ops bops is
so learning while pruning structure attacks cost directly
the unified pipeline
- optional decompose layers to expose rank options
- train with float bit widths and polarization and auto prune when ()
- recompose if factorization is not beneficial
- uniformize per layer settings and round and freeze and quick fine tune then deploy
challenges
getting the penalty for bit widths right is critical
too weak and the model never compresses
too strong and it collapses and loses accuracy
balancing task loss and compression regularizer required careful sweeps and multiple experiments
results
on lenet 5 with mnist the unified method held accuracy even plus 0.01 percent while using only about 0.78 percent of original bops
on vgg 7 with cifar 10 it remained competitive versus mixed precision baselines
summary
- quantization reduces precision
- pruning unstructured versus structured removes parameters
- low rank factors heavy layers
- the method unifies them under trainable bit widths with polarization so pruning and rank selection fall out naturally
- this maps cleanly to fpga costs via bops and avoids brittle order dependent pipelines
patent
this work was done jointly with Benoit Porteboeuf who was my tutor throughout the internship and provided invaluable guidance
it resulted in a patent filing that covers the unified compression framework
the patent describes the method for jointly learning quantization levels, pruning decisions, and low rank factorizations during neural network training which makes it easier to deploy compressed models on resource constrained hardware