Train Your Own OCR Model with PaddleOCR (From Scratch)

Pre-trained OCR models work well in many scenarios.
However, when dealing with custom PDFs, scanned documents, or domain-specific images, generic OCR models often fail to achieve acceptable accuracy.

In such cases, training or fine-tuning your own OCR model becomes necessary.

This article provides a practical, end-to-end guide to training an OCR model from scratch using PaddleOCR, covering data preparation, labeling, training, evaluation, and deployment.

What Is Paddle OCR

PaddleOCR is an open-source OCR toolkit developed by Baidu.
It is built on the PaddlePaddle deep learning framework and provides a complete OCR pipeline.

PaddleOCR supports:

Text detection
Text recognition
Key information extraction (KIE)
More than 80 languages, including Chinese and English

It is widely used in document digitization, invoice processing, ID recognition, and industrial OCR systems.

OCR Task Overview

PaddleOCR divides OCR tasks into several core modules.

Text Detection (det)

Text detection locates text regions in an image.
The output is bounding boxes that define where text lines appear.

Text Recognition (rec)

Text recognition converts detected text regions into actual characters.
This step transforms image patches into readable text strings.

Key Information Extraction (kie)

Key information extraction focuses on structured documents such as forms and invoices.

It consists of two subtasks:

SER (Semantic Entity Recognition)
Classifies each text segment into categories such as name, date, or address.
RE (Relation Extraction)
Links related text segments into key–value pairs.

A typical KIE pipeline follows:

Text Detection → Text Recognition → SER → RE

Evaluation Metrics

PaddleOCR uses three standard metrics:

Precision
Recall
Hmean (F1 score)

In practice, Hmean is the most important indicator.
A value closer to 1.0 indicates better overall performance.

Training Architecture Overview

Training vs Inference Models

Inference models are lightweight and optimized for deployment.
However, they cannot be used for further training.

If fine-tuning is required, training models must be used instead.

Recommended Project Structure

OCRProject/
├── inference/        # Inference models
├── output/           # Training outputs
├── PaddleOCR/        # PaddleOCR source code
└── train_data/       # Training datasets

Environment Setup

Correct environment configuration is critical.
Even minor version mismatches can cause runtime errors.

Python Version Requirement

Python 3.7 – 3.9 (strict requirement)

Clone PaddleOCR

git clone https://gitee.com/paddlepaddle/PaddleOCR.git
cd PaddleOCR
git checkout release/2.7
pip install -r requirements.txt

Install Core Dependencies (CPU)

pip install paddlepaddle==2.6.0
pip install paddlenlp==2.6.0
pip install paddleocr==2.6.1.3
pip install numpy==1.24.0
pip install xlrd setuptools_scm==7.0.4

Dataset Creation with PPOCRLabel

PPOCRLabel is a semi-automated labeling tool included in Paddle OCR.
It supports text detection, recognition, and KIE labeling.

Launch PPOCRLabel

cd PaddleOCR/PPOCRLabel
python PPOCRLabel.py --lang ch --kie True

Labeling Guidelines

Label each text line separately
Avoid overlapping bounding boxes
Mark irrelevant text as other
Around 50 labeled images are usually sufficient for stable results

Exported Files

After labeling, PPOCRLabel generates:

Label.txt – detection labels
rec_gt.txt – recognition labels
crop_img/ – cropped text images
fileState.txt – labeling status

Dataset Conversion

Detection and Recognition Dataset

python gen_ocr_train_val_test.py --datasetRootPath /path/to/dataset

The script splits data into training, validation, and test sets.

SER Dataset Preparation

For SER training:

Use detection datasets
Replace the field name key_cls with label in annotation files

RE Dataset Notes

RE datasets require explicit key–value linking.
Each question must be linked to its corresponding answer using ID pairs.

Model Training Workflow

Download Pre-trained Models

Official model links:

Detection & Recognition
https://gitee.com/paddlepaddle/PaddleOCR/blob/release/2.7/doc/doc_ch/models_list.md
SER & RE
https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.7/doc/doc_ch/algorithm_kie_layoutxlm.md

Detection Model Fine-tuning

Text detection models often merge adjacent text lines.
Fine-tuning improves layout separation for structured documents.

Start Training

python tools/train.py \
  -c configs/det/your_config.yml \
  -o Global.use_gpu=False

Evaluate Model

python tools/eval.py \
  -c configs/det/your_config.yml \
  -o Global.checkpoints=./output/det/best_accuracy

Export Inference Model

python tools/export_model.py -c configs/det/your_config.yml

Recognition Model Fine-tuning

The training process for recognition models is identical to detection models.
Only the configuration file differs.

SER Model Training

SER models classify text lines into semantic categories.

Start Training

python tools/train.py \
  -c configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml \
  -o Architecture.Backbone.checkpoints=/path/to/pretrained/model

Inference Testing

python ppstructure/kie/predict_kie_token_ser.py \
  --kie_algorithm=LayoutXLM \
  --ser_model_dir=/path/to/ser/inference \
  --image_dir=/path/to/images \
  --ser_dict_path=/path/to/label_dict

Deployment Notes

Use inference models for production
Keep training models for iteration
Monitor Hmean to avoid overfitting

Quick Summary

This article explains how to train an OCR model from scratch using PaddleOCR.
It covers environment setup, data labeling, dataset conversion, and fine-tuning.
The workflow applies to detection, recognition, and key information extraction tasks.