post thumbnail

Train Your Own OCR Model from Scratch with Paddle OCR

Build a custom OCR model from zero. Collect and label datasets, generate synthetic text, and apply augmentation. Train CRNN/Transformer architectures in PyTorch with CTC/attention losses, tune hyperparameters, and evaluate CER/WER. Export to ONNX/TensorRT, deploy with fast inference, and monitor drift. Includes code, configs, and reproducible end-to-end pipeline for production systems.

2025-11-25

Pre-trained OCR models work well in many scenarios.
However, when dealing with custom PDFs, scanned documents, or domain-specific images, generic OCR models often fail to achieve acceptable accuracy.

In such cases, training or fine-tuning your own OCR model becomes necessary.

This article provides a practical, end-to-end guide to training an OCR model from scratch using PaddleOCR, covering data preparation, labeling, training, evaluation, and deployment.


What Is Paddle OCR

PaddleOCR is an open-source OCR toolkit developed by Baidu.
It is built on the PaddlePaddle deep learning framework and provides a complete OCR pipeline.

PaddleOCR supports:

It is widely used in document digitization, invoice processing, ID recognition, and industrial OCR systems.


OCR Task Overview

PaddleOCR divides OCR tasks into several core modules.

Text Detection (det)

Text detection locates text regions in an image.
The output is bounding boxes that define where text lines appear.

Text Recognition (rec)

Text recognition converts detected text regions into actual characters.
This step transforms image patches into readable text strings.

Key Information Extraction (kie)

Key information extraction focuses on structured documents such as forms and invoices.

It consists of two subtasks:

A typical KIE pipeline follows:

Text Detection → Text Recognition → SER → RE

Evaluation Metrics

PaddleOCR uses three standard metrics:

In practice, Hmean is the most important indicator.
A value closer to 1.0 indicates better overall performance.


Training Architecture Overview

Training vs Inference Models

Inference models are lightweight and optimized for deployment.
However, they cannot be used for further training.

If fine-tuning is required, training models must be used instead.

Recommended Project Structure

OCRProject/
├── inference/        # Inference models
├── output/           # Training outputs
├── PaddleOCR/        # PaddleOCR source code
└── train_data/       # Training datasets

Environment Setup

Correct environment configuration is critical.
Even minor version mismatches can cause runtime errors.

Python Version Requirement

Clone PaddleOCR

git clone https://gitee.com/paddlepaddle/PaddleOCR.git
cd PaddleOCR
git checkout release/2.7
pip install -r requirements.txt

Install Core Dependencies (CPU)

pip install paddlepaddle==2.6.0
pip install paddlenlp==2.6.0
pip install paddleocr==2.6.1.3
pip install numpy==1.24.0
pip install xlrd setuptools_scm==7.0.4

Dataset Creation with PPOCRLabel

PPOCRLabel is a semi-automated labeling tool included in Paddle OCR.
It supports text detection, recognition, and KIE labeling.

Launch PPOCRLabel

cd PaddleOCR/PPOCRLabel
python PPOCRLabel.py --lang ch --kie True

Labeling Guidelines

Exported Files

After labeling, PPOCRLabel generates:


Dataset Conversion

Detection and Recognition Dataset

python gen_ocr_train_val_test.py --datasetRootPath /path/to/dataset

The script splits data into training, validation, and test sets.

SER Dataset Preparation

For SER training:

RE Dataset Notes

RE datasets require explicit key–value linking.
Each question must be linked to its corresponding answer using ID pairs.


Model Training Workflow

Download Pre-trained Models

Official model links:


Detection Model Fine-tuning

Text detection models often merge adjacent text lines.
Fine-tuning improves layout separation for structured documents.

Start Training

python tools/train.py \
  -c configs/det/your_config.yml \
  -o Global.use_gpu=False

Evaluate Model

python tools/eval.py \
  -c configs/det/your_config.yml \
  -o Global.checkpoints=./output/det/best_accuracy

Export Inference Model

python tools/export_model.py -c configs/det/your_config.yml

Recognition Model Fine-tuning

The training process for recognition models is identical to detection models.
Only the configuration file differs.


SER Model Training

SER models classify text lines into semantic categories.

Start Training

python tools/train.py \
  -c configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml \
  -o Architecture.Backbone.checkpoints=/path/to/pretrained/model

Inference Testing

python ppstructure/kie/predict_kie_token_ser.py \
  --kie_algorithm=LayoutXLM \
  --ser_model_dir=/path/to/ser/inference \
  --image_dir=/path/to/images \
  --ser_dict_path=/path/to/label_dict

Deployment Notes


Quick Summary