Arabic Handwritten Text Recognition using TrOCR

Fine-tuning a pre-trained Transformer-based OCR model for Arabic handwritten text-line recognition on KHATT v1.0.

Abstract

Arabic handwritten text recognition is a challenging OCR problem because Arabic script is cursive, characters change shape depending on their position in a word, and handwriting styles vary significantly between writers.

This project fine-tunes TrOCR, a Transformer-based encoder-decoder OCR model, on the KHATT v1.0 Arabic handwritten text-line dataset.

Our system takes an image of a handwritten Arabic text line as input and outputs the corresponding digital Arabic text. The fine-tuned TrOCR model achieved a CER of 8.10% and a WER of 12.50%, showing a strong improvement compared with the cross-attention Arabic handwritten recognition baseline.

Task Arabic handwritten text-line recognition
Dataset KHATT v1.0
Main Model Fine-tuned TrOCR
Main Metric Character Error Rate (CER)

Teaser Figure

A handwritten Arabic text-line image is passed through a fine-tuned TrOCR encoder-decoder model, which generates the recognized Arabic text. The system is trained and evaluated on KHATT v1.0 and compared against a cross-attention Arabic handwritten recognition baseline.

Teaser figure
Figure 1: Overview of the proposed Arabic handwritten OCR pipeline.

Introduction

Optical Character Recognition (OCR) converts images of text into machine-readable text. While OCR systems have achieved strong performance for printed Latin text, Arabic handwritten OCR remains difficult because Arabic is cursive and highly variable.

Traditional OCR systems usually combine CNNs for visual feature extraction, RNNs for sequence modeling, and CTC for alignment. These pipelines are effective but complex.

TrOCR simplifies the pipeline by using a Vision Transformer encoder and a Transformer decoder in a fully end-to-end architecture.

In this project, we investigate whether a large pre-trained Transformer OCR model can transfer effectively to Arabic handwritten text recognition.

Approach

System Input and Output

The input to the system is a single handwritten Arabic text-line image. The output is the corresponding digital Arabic text transcription.

الذكاء الاصطناعي يغير العالم

TrOCR Architecture

TrOCR is an encoder-decoder Transformer OCR model. The encoder processes the image and extracts visual features, while the decoder generates the output text token by token.

1. Image Preprocessing Input handwritten images are resized and normalized before training.
2. Vision Encoder The image is divided into patches and processed using a Vision Transformer encoder.
3. Cross Attention The decoder attends to visual features extracted by the encoder while generating text.
4. Text Decoder The decoder generates Arabic text autoregressively, one token at a time.

Arabic Decoder Modification

Since the original TrOCR decoder is mainly designed for English language modeling, we modified the decoder to better support Arabic text generation by initializing it with bert-base-arabic.

This modification helps the model better understand Arabic vocabulary, linguistic structure, and contextual relationships during autoregressive text generation. The goal of this change is to improve recognition quality for Arabic handwritten text and reduce character-level prediction errors.

Implementation

We used Python, PyTorch, and the Hugging Face Transformers library. Existing TrOCR model weights were used as initialization, while preprocessing, fine-tuning, evaluation, and metric computation were implemented for this project.

Challenges

Experiments and Results

Dataset

We use KHATT v1.0, an Arabic offline handwritten text-line dataset containing samples from many writers.

Evaluation Metrics

Metric Description Direction
CER Character-level edit distance Lower is better
WER Word-level edit distance Lower is better
BLEU Sequence similarity metric Higher is better

Quantitative Results

Baseline CER
18.45%
Cross-attention Arabic HTR baseline
Our CER
8.10%
Fine-tuned TrOCR
Our WER
12.50%
Fine-tuned TrOCR
The fine-tuned TrOCR model reduced CER from 18.45% to 8.10%, corresponding to approximately a 56% relative reduction in character error rate.
Model Dataset CER ↓ WER ↓
Cross-Attention Arabic HTR Baseline KHATT 18.45%
Fine-tuned TrOCR + Arabic Decoder KHATT v1.0 8.10% 12.50%

Training and Evaluation Curves

Training Loss
Figure 2: Training loss curve.
Evaluation Loss
Figure 3: Evaluation loss curve.

Qualitative Results

The following examples demonstrate predictions generated by the fine-tuned TrOCR model on handwritten Arabic text-line images.

Model Outputs
Figure 4: Example predictions produced by the fine-tuned TrOCR model.
Common OCR failure cases include confusing visually similar Arabic letters, missing dots, connected words, and noisy handwriting.

Conclusion and Future Work

This project demonstrates that fine-tuning TrOCR is highly effective for Arabic handwritten text-line recognition. The model achieved a CER of 8.10% and significantly outperformed the selected cross-attention Arabic handwritten recognition baseline.

Future work includes testing larger TrOCR variants, exploring stronger augmentation strategies, evaluating more Arabic handwriting datasets, and improving Arabic-specific language modeling.

References

  1. Minghao Li et al., TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models, 2022.
  2. Saleh Momeni and Bagher BabaAli, A Transformer-based Approach for Arabic Offline Handwritten Text Recognition, 2023.
  3. KHATT v1.0 Dataset.
  4. Hugging Face Transformers Library.
  5. PyTorch Deep Learning Framework.