Arabic Handwritten Text Recognition using TrOCR

Abstract

Arabic handwritten text recognition is a challenging OCR problem because Arabic script is cursive, characters change shape depending on their position in a word, and handwriting styles vary significantly between writers.

This project fine-tunes TrOCR, a Transformer-based encoder-decoder OCR model, on the KHATT v1.0 Arabic handwritten text-line dataset.

Our system takes an image of a handwritten Arabic text line as input and outputs the corresponding digital Arabic text. The fine-tuned TrOCR model achieved a CER of 8.10% and a WER of 12.50%, showing a strong improvement compared with the cross-attention Arabic handwritten recognition baseline.

Task Arabic handwritten text-line recognition

Dataset KHATT v1.0

Main Model Fine-tuned TrOCR

Main Metric Character Error Rate (CER)

Teaser Figure

A handwritten Arabic text-line image is passed through a fine-tuned TrOCR encoder-decoder model, which generates the recognized Arabic text. The system is trained and evaluated on KHATT v1.0 and compared against a cross-attention Arabic handwritten recognition baseline.

Figure 1: Overview of the proposed Arabic handwritten OCR pipeline.

Introduction

Optical Character Recognition (OCR) converts images of text into machine-readable text. While OCR systems have achieved strong performance for printed Latin text, Arabic handwritten OCR remains difficult because Arabic is cursive and highly variable.

Traditional OCR systems usually combine CNNs for visual feature extraction, RNNs for sequence modeling, and CTC for alignment. These pipelines are effective but complex.

TrOCR simplifies the pipeline by using a Vision Transformer encoder and a Transformer decoder in a fully end-to-end architecture.

In this project, we investigate whether a large pre-trained Transformer OCR model can transfer effectively to Arabic handwritten text recognition.

Approach

System Input and Output

The input to the system is a single handwritten Arabic text-line image. The output is the corresponding digital Arabic text transcription.

الذكاء الاصطناعي يغير العالم

TrOCR Architecture

TrOCR is an encoder-decoder Transformer OCR model. The encoder processes the image and extracts visual features, while the decoder generates the output text token by token.

1. Image Preprocessing Input handwritten images are resized and normalized before training.

2. Vision Encoder The image is divided into patches and processed using a Vision Transformer encoder.

3. Cross Attention The decoder attends to visual features extracted by the encoder while generating text.

4. Text Decoder The decoder generates Arabic text autoregressively, one token at a time.

Arabic Decoder Modification

Since the original TrOCR decoder is mainly designed for English language modeling, we modified the decoder to better support Arabic text generation by initializing it with bert-base-arabic.

This modification helps the model better understand Arabic vocabulary, linguistic structure, and contextual relationships during autoregressive text generation. The goal of this change is to improve recognition quality for Arabic handwritten text and reduce character-level prediction errors.

Implementation

We used Python, PyTorch, and the Hugging Face Transformers library. Existing TrOCR model weights were used as initialization, while preprocessing, fine-tuning, evaluation, and metric computation were implemented for this project.

Challenges

Arabic writing direction: Arabic is written from right to left.
Handwriting variability: KHATT contains many different writing styles.
Character ambiguity: Many Arabic letters differ only by dots.
Evaluation fairness: CER is used as the primary evaluation metric.

Experiments and Results

Dataset

We use KHATT v1.0, an Arabic offline handwritten text-line dataset containing samples from many writers.

Evaluation Metrics

Metric	Description	Direction
CER	Character-level edit distance	Lower is better
WER	Word-level edit distance	Lower is better
BLEU	Sequence similarity metric	Higher is better

Quantitative Results

Baseline CER

18.45%

Cross-attention Arabic HTR baseline

Our CER

8.10%

Fine-tuned TrOCR

Our WER

12.50%

Fine-tuned TrOCR

        The fine-tuned TrOCR model reduced CER from 18.45% to 8.10%, corresponding to approximately a 56% relative reduction in character error rate.
      

Model	Dataset	CER ↓	WER ↓
Cross-Attention Arabic HTR Baseline	KHATT	18.45%	—
Fine-tuned TrOCR + Arabic Decoder	KHATT v1.0	8.10%	12.50%

Training and Evaluation Curves

Figure 2: Training loss curve.

Figure 3: Evaluation loss curve.

Qualitative Results

The following examples demonstrate predictions generated by the fine-tuned TrOCR model on handwritten Arabic text-line images.

Figure 4: Example predictions produced by the fine-tuned TrOCR model.

Common OCR failure cases include confusing visually similar Arabic letters, missing dots, connected words, and noisy handwriting.

Conclusion and Future Work

This project demonstrates that fine-tuning TrOCR is highly effective for Arabic handwritten text-line recognition. The model achieved a CER of 8.10% and significantly outperformed the selected cross-attention Arabic handwritten recognition baseline.

Future work includes testing larger TrOCR variants, exploring stronger augmentation strategies, evaluating more Arabic handwriting datasets, and improving Arabic-specific language modeling.

References

Minghao Li et al., TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models, 2022.
Saleh Momeni and Bagher BabaAli, A Transformer-based Approach for Arabic Offline Handwritten Text Recognition, 2023.
KHATT v1.0 Dataset.
Hugging Face Transformers Library.
PyTorch Deep Learning Framework.