sqlgbm - running ml models directly in sql

2024-03-10

most production ml pipelines have the same architecture train a model in python, pickle it, spin up a serving layer, then shuttle data back and forth between your database and the inference service

this works but it’s annoying you add latency with every network hop, you need to maintain separate infrastructure, and you’re solving a problem that maybe doesn’t need to exist

how gradient boosting works

gradient boosting builds an ensemble of decision trees sequentially each tree tries to correct the errors of all previous trees

you start with a simple prediction, then add a tree that learns where you were wrong, then add another tree that fixes those mistakes, and repeat the final prediction is the sum of all trees

mathematically it’s sophisticated but conceptually it’s just iterative error correction 100 trees might sound complex but each tree is just a series of if-then rules

the insight

tree based models are just nested if statements a gradient boosting model with 100 trees is really 100 decision paths that get summed together there’s nothing inherently complex about that logic

sql has case statements databases are already optimized for executing queries over tabular data so why not just generate the sql equivalent of your model and run predictions where the data lives

traditionaldatabaseexportml apireturndatabaselatency + complexitysqlgbmdatabase(runs sql)predictionssingle query
traditional ml serving vs running predictions directly in the database

what sqlgbm does

sqlgbm takes a trained lightgbm or xgboost model and spits out a sql query that reproduces its predictions

you train your model normally

clf = lgb.LGBMClassifier(n_estimators=100, max_depth=5)
clf.fit(X, y)

then convert it to sql

from sqlgbm import SQLGBM
sqlgbm = SQLGBM(clf)
sql = sqlgbm.generate_query('my_table', 'probability')

now you have a query you can run directly in postgres or mysql or duckdb no python runtime, no model server, no serialization overhead

decision treeage < 30?yesnoincome > 50k?edu = high?0.80.20.90.4converts tosql case statementCASEWHEN age < 30 THENCASEWHEN income > 50000 THEN 0.8ELSE 0.2ENDELSECASE WHEN edu = 'high' THEN 0.9 ELSE 0.4 ENDEND
decision tree structure maps directly to nested sql case statements

why this is useful

if your features live in a database and you need to score them this eliminates a ton of complexity

batch scoring thousands of rows becomes a single query instead of exporting data, calling an api, waiting for results, and writing back

for real time predictions you can materialize the query as a view or use it in a stored procedure the database handles caching and query optimization automatically

latency drops because you’re not serializing data or making network calls deployment becomes simpler because there’s no separate service to manage

tradeoffs

this only makes sense for certain use cases

tree models translate cleanly to sql but neural networks don’t very deep models generate huge queries that some databases struggle with the sql isn’t optimized yet so there’s room for improvement

multiclass classification isn’t supported yet and the api is rough around the edges it’s early stage software

but for binary classification with gradient boosting on data that lives in sql it works well i use it for simple scoring tasks where spinning up model serving infrastructure feels like overkill

summary

the core idea is that ml inference doesn’t always need special infrastructure sometimes you can just generate sql and let your database do the work

sqlgbm converts lightgbm and xgboost models to queries that run anywhere sql does no runtime dependencies, no separate services, no data movement

useful when you want simple deployment and your data already lives in a database

check it out at github.com/mattismegevand/sqlgbm