SmolDocling-OCR: Lightweight Document OCR Pipeline

Machine Learningcompleted

SmolDocling-OCR: Lightweight Document OCR Pipeline

Nov 20233 months

0 GPUruns entirely on CPU — pre-processing, not horsepower, drove most of the accuracy gains

Overview

OCR demos love clean, flat scans. Real documents are skewed, noisy, photographed at an angle, and sometimes handwritten. SmolDocling-OCR is my attempt at a pipeline that survives that mess without needing a GPU farm: enhance the image first, then extract. It pairs OpenCV and Pillow pre-processing with Tesseract, wrapped in a tiny Flask API and a simple React upload page, so a non-technical user can drop in an image and get text back.

Tech Stack

frontend

React

backend

PythonFlask

other

Tesseract OCROpenCVPillow

Challenges

Documents arrive in every quality imaginable — skew, glare, low contrast, handwriting.
Squeezing accuracy out of Tesseract, which is only ever as good as the image you feed it.
Keeping the interface simple enough for someone who's never heard of OCR.
Staying lightweight enough to run on a laptop or a cheap VM.

Solution

The pipeline does the unglamorous work up front — deskew, denoise, threshold, and contrast-normalize with OpenCV and Pillow — because most OCR "accuracy" problems are really image-quality problems in disguise. Cleaned images go to Tesseract; a Flask backend handles uploads and the OCR job, and a React front-end shows the extracted text. It's containerized, so it deploys the same way on a laptop or a small server.

Outcome

Pre-processing turned out to be the whole game: cleaning the image before Tesseract ever saw it lifted extraction quality dramatically on the rough documents that actually matter, and the pipeline stays light enough to run almost anywhere. The modular design means swapping in a different OCR engine later is a one-file change.

What I'd do differently

Tesseract was the right call for a lightweight build, but for handwriting and dense layouts I hit its ceiling. I'd add an optional transformer-based OCR backend (the Docling / TrOCR family the name nods to) for the hard cases, keeping Tesseract as the fast default.

Built with

PythonTesseract OCROpenCVPillowFlaskReact