CamelParser-Dialects is a state-of-the-art dependency parsing model for dialectal Arabic and Modern Standard Arabic (MSA), designed under the CATiB dependency formalism.
It is based on the biaffine attention parser architecture introduced by Dozat and Manning (2017), implemented using SuPar. The model leverages CamelBERT-MIX, a pretrained language model trained on a large and diverse Arabic corpus.
Full details are available in our paper: "Parsing Arabic Dialects Revisited: New Benchmarks, Models, and Insights"
| Checkpoint | Training Data | MSA | EGY | GLF | AVG | |
|---|---|---|---|---|---|---|
CAMeL-Lab/camelparser-dialects-MSA |
CamelTB, PATB | 87.3 | 73.0 | 73.3 | 77.9 | |
CAMeL-Lab/camelparser-dialects-EGY |
ARZTB | 79.2 | 83.9 | 68.7 | 77.3 | |
CAMeL-Lab/camelparser-dialects-GLF |
CamelTB-Gumar | 65.4 | 58.7 | 73.8 | 66.0 | |
CAMeL-Lab/camelparser-dialects-MSA-EGY |
CamelTB, PATB, ARZTB | 87.1 | 84.4 | 70.1 | 79.8 | |
CAMeL-Lab/camelparser-dialects-MSA-GLF |
CamelTB, PATB, CamelTB-Gumar | 87.2 | 74.4 | 81.0 | 80.9 | |
CAMeL-Lab/camelparser-dialects-EGY-GLF |
ARZTB, CamelTB-Gumar | 80.0 | 83.8 | 79.4 | 81.1 | |
CAMeL-Lab/camelparser-dialects-MSA-EGY-GLF |
CamelTB, PATB, ARZTB, CamelTB-Gumar | 87.2 | 84.2 | 80.3 | 83.9 |
- LAS (Labeled Attachment Score) on TEST
- The recommended checkpoint is the all-variety model (
MSA-EGY-GLF), which provides the best overall cross-dialect performance. - Model weights are compatible with CamelParser2.0 and SuPar. This repository includes a CamelParser submodule and CLI wrapper for direct inference.
Clone this repository with its CamelParser submodule:
git clone --recurse-submodules https://github.com/CAMeL-Lab/camel_parser_dialects.git
cd camel_parser_dialectsIf you already cloned the repository, initialize the submodule:
git submodule update --init --recursiveCreate an environment and install CamelParser dependencies:
conda create -n camel-parser-dialects python=3.11.13
conda activate camel-parser-dialects
pip install -r camel_parser/requirements.txt
pip install conlluFor raw text (-f text) and cleaned whitespace-tokenized text (-f preprocessed_text), install the default CAMeL Tools morphology and disambiguation data:
camel_data -i morphology-db-msa-r13
camel_data -i disambig-bert-unfactored-msaThese CAMeL Tools data packages are not needed for -f conll, -f tokenized, or -f tokenized_tagged.
List available model aliases:
python download_models.py --list-modelsDownload all dialect parser models:
python download_models.pyDownload one model:
python download_models.py --model msa-egy-glfThe available aliases are:
| Alias | Hugging Face repo |
|---|---|
msa |
CAMeL-Lab/camelparser-dialects-MSA |
egy |
CAMeL-Lab/camelparser-dialects-EGY |
glf |
CAMeL-Lab/camelparser-dialects-GLF |
msa-egy |
CAMeL-Lab/camelparser-dialects-MSA-EGY |
msa-glf |
CAMeL-Lab/camelparser-dialects-MSA-GLF |
egy-glf |
CAMeL-Lab/camelparser-dialects-EGY-GLF |
msa-egy-glf |
CAMeL-Lab/camelparser-dialects-MSA-EGY-GLF |
The CLI also accepts catib- prefixed aliases, e.g., catib-msa-egy-glf.
Parse a raw text file:
python dialect_parse_cli.py -i input.txt -f text -m msa-egy-glf > output.conllxParse a string:
python dialect_parse_cli.py -s "جامعة نيويورك تنشر أطلس." -f text -m msa-egy-glfParse cleaned, whitespace-tokenized text:
python dialect_parse_cli.py -i input_tokenized.txt -f preprocessed_text -m glf > output.conllxParse already-tokenized text without POS tagging or feature generation:
python dialect_parse_cli.py -s "جامعة نيويورك تنشر أطلس ." -f tokenized -m egyParse an existing CoNLL/CoNLL-X file:
python dialect_parse_cli.py -i input.conllx -f conll -m msa-egy > output.conllxThe CLI downloads the selected model from Hugging Face if it is not already present under models/.
The models are trained on combinations of the following treebanks:
- CamelTB (MSA): camel_treebank_1.1.zip
- PATB (Penn Arabic Treebank): LDC2010T13, LDC2011T09, LDC2010T08
- ARZTB (Egyptian Arabic Treebank): LDC2018T23
- CamelTB-Gumar (Gulf Arabic):
CamelTB-Gumar.1.0.zip
The preprocesessed data can be extracted using muddler.
Once installed with pip install muddler, extract muddled files provided under data/ directory with the following files.
-
CamelTB (MSA):
-
Download
camel_treebank_1.1.zipfrom: -
Run the following command to unlock the muddled file.
muddler unmuddle -s camel_treebank_1.1.zip -m data/CamelTB.zip.muddle data/CamelTB.zip
-
Unzip the file with
unzip data/CamelTB.zip -f data
-
-
PATB (Penn Arabic Treebank):
-
Download the following files from the following LDC releases:
atb1_v4_1_LDC2010T13.tgz: https://catalog.ldc.upenn.edu/LDC2010T13atb_2_3.1_LDC2011T09.tgz: https://catalog.ldc.upenn.edu/LDC2011T09atb3_v3_2_LDC2010T08.tgz: https://catalog.ldc.upenn.edu/LDC2010T08
-
Place them in a directory, e.g.,
ldc_files/ -
Run the following command to unlock the muddled file.
muddler unmuddle -s ldc_files -m data/PATB.zip.muddle data/PATB.zip
-
Unzip the file with
unzip data/PATB.zip -d data
-
-
ARZTB (Egyptian Arabic Treebank):
-
Download
bolt_arz-df_LDC2018T23.tgzfrom: -
Run the following command to unlock the muddled file.
muddler unmuddle -s bolt_arz-df_LDC2018T23.tgz -m data/arz_data.zip.muddle data/arz_data.zip
-
Unzip the file with
unzip data/arz_data.zip -d data
-
-
CamelTB-Gumar (Gulf Arabic):
-
Download
CamelTB-Gumar.1.0.zipfrom: -
Run the following command to unlock the muddled file.
muddler unmuddle -s CamelTB-Gumar.1.0.zip -m data/CamelTB-Gumar_data.zip.muddle data/CamelTB-Gumar_data.zip
-
Unzip the file with
unzip data/CamelTB-Gumar_data.zip -data
-
If you use this model, please cite:
@inproceedings{Elshabrawy:2026:camelparser-dialects,
title = "{Parsing Arabic Dialects Revisited: New Benchmarks, Models, and Insights}",
author = {Ahmed Elshabrawy and
Go Inoue and
Muhammed AbuOdeh and
Nizar Habash} ,
booktitle = {Proceedings of The 7th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT)},
year = "2026",
address = "Palma, Spain"
}