Pre-trained OCR models work well in many scenarios.
However, when dealing with custom PDFs, scanned documents, or domain-specific images, generic OCR models often fail to achieve acceptable accuracy.
In such cases, training or fine-tuning your own OCR model becomes necessary.
This article provides a practical, end-to-end guide to training an OCR model from scratch using PaddleOCR, covering data preparation, labeling, training, evaluation, and deployment.
What Is Paddle OCR
PaddleOCR is an open-source OCR toolkit developed by Baidu.
It is built on the PaddlePaddle deep learning framework and provides a complete OCR pipeline.
PaddleOCR supports:
- Text detection
- Text recognition
- Key information extraction (KIE)
- More than 80 languages, including Chinese and English
It is widely used in document digitization, invoice processing, ID recognition, and industrial OCR systems.
OCR Task Overview
PaddleOCR divides OCR tasks into several core modules.
Text Detection (det)
Text detection locates text regions in an image.
The output is bounding boxes that define where text lines appear.
Text Recognition (rec)
Text recognition converts detected text regions into actual characters.
This step transforms image patches into readable text strings.
Key Information Extraction (kie)
Key information extraction focuses on structured documents such as forms and invoices.
It consists of two subtasks:
- SER (Semantic Entity Recognition)
Classifies each text segment into categories such as name, date, or address. - RE (Relation Extraction)
Links related text segments into key–value pairs.
A typical KIE pipeline follows:
Text Detection → Text Recognition → SER → RE
Evaluation Metrics
PaddleOCR uses three standard metrics:
- Precision
- Recall
- Hmean (F1 score)
In practice, Hmean is the most important indicator.
A value closer to 1.0 indicates better overall performance.
Training Architecture Overview
Training vs Inference Models
Inference models are lightweight and optimized for deployment.
However, they cannot be used for further training.
If fine-tuning is required, training models must be used instead.
Recommended Project Structure
OCRProject/
├── inference/ # Inference models
├── output/ # Training outputs
├── PaddleOCR/ # PaddleOCR source code
└── train_data/ # Training datasets
Environment Setup
Correct environment configuration is critical.
Even minor version mismatches can cause runtime errors.
Python Version Requirement
- Python 3.7 – 3.9 (strict requirement)
Clone PaddleOCR
git clone https://gitee.com/paddlepaddle/PaddleOCR.git
cd PaddleOCR
git checkout release/2.7
pip install -r requirements.txt
Install Core Dependencies (CPU)
pip install paddlepaddle==2.6.0
pip install paddlenlp==2.6.0
pip install paddleocr==2.6.1.3
pip install numpy==1.24.0
pip install xlrd setuptools_scm==7.0.4
Dataset Creation with PPOCRLabel
PPOCRLabel is a semi-automated labeling tool included in Paddle OCR.
It supports text detection, recognition, and KIE labeling.
Launch PPOCRLabel
cd PaddleOCR/PPOCRLabel
python PPOCRLabel.py --lang ch --kie True
Labeling Guidelines
- Label each text line separately
- Avoid overlapping bounding boxes
- Mark irrelevant text as
other - Around 50 labeled images are usually sufficient for stable results
Exported Files
After labeling, PPOCRLabel generates:
Label.txt– detection labelsrec_gt.txt– recognition labelscrop_img/– cropped text imagesfileState.txt– labeling status
Dataset Conversion
Detection and Recognition Dataset
python gen_ocr_train_val_test.py --datasetRootPath /path/to/dataset
The script splits data into training, validation, and test sets.
SER Dataset Preparation
For SER training:
- Use detection datasets
- Replace the field name
key_clswithlabelin annotation files
RE Dataset Notes
RE datasets require explicit key–value linking.
Each question must be linked to its corresponding answer using ID pairs.
Model Training Workflow
Download Pre-trained Models
Official model links:
- Detection & Recognition
https://gitee.com/paddlepaddle/PaddleOCR/blob/release/2.7/doc/doc_ch/models_list.md - SER & RE
https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.7/doc/doc_ch/algorithm_kie_layoutxlm.md
Detection Model Fine-tuning
Text detection models often merge adjacent text lines.
Fine-tuning improves layout separation for structured documents.
Start Training
python tools/train.py \
-c configs/det/your_config.yml \
-o Global.use_gpu=False
Evaluate Model
python tools/eval.py \
-c configs/det/your_config.yml \
-o Global.checkpoints=./output/det/best_accuracy
Export Inference Model
python tools/export_model.py -c configs/det/your_config.yml
Recognition Model Fine-tuning
The training process for recognition models is identical to detection models.
Only the configuration file differs.
SER Model Training
SER models classify text lines into semantic categories.
Start Training
python tools/train.py \
-c configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml \
-o Architecture.Backbone.checkpoints=/path/to/pretrained/model
Inference Testing
python ppstructure/kie/predict_kie_token_ser.py \
--kie_algorithm=LayoutXLM \
--ser_model_dir=/path/to/ser/inference \
--image_dir=/path/to/images \
--ser_dict_path=/path/to/label_dict
Deployment Notes
- Use inference models for production
- Keep training models for iteration
- Monitor Hmean to avoid overfitting
Quick Summary
- This article explains how to train an OCR model from scratch using PaddleOCR.
- It covers environment setup, data labeling, dataset conversion, and fine-tuning.
- The workflow applies to detection, recognition, and key information extraction tasks.