camel_parser_dialects

CamelParser-Dialects is a state-of-the-art dependency parsing model for dialectal Arabic and Modern Standard Arabic (MSA), designed under the CATiB dependency formalism.

It is based on the biaffine attention parser architecture introduced by Dozat and Manning (2017), implemented using SuPar. The model leverages CamelBERT-MIX, a pretrained language model trained on a large and diverse Arabic corpus.

Full details are available in our paper: "Parsing Arabic Dialects Revisited: New Benchmarks, Models, and Insights"

📊 Model Variants

Checkpoint	Training Data	MSA	EGY	GLF	AVG
`CAMeL-Lab/camelparser-dialects-MSA`	CamelTB, PATB	87.3	73.0	73.3	77.9
`CAMeL-Lab/camelparser-dialects-EGY`	ARZTB	79.2	83.9	68.7	77.3
`CAMeL-Lab/camelparser-dialects-GLF`	CamelTB-Gumar	65.4	58.7	73.8	66.0
`CAMeL-Lab/camelparser-dialects-MSA-EGY`	CamelTB, PATB, ARZTB	87.1	84.4	70.1	79.8
`CAMeL-Lab/camelparser-dialects-MSA-GLF`	CamelTB, PATB, CamelTB-Gumar	87.2	74.4	81.0	80.9
`CAMeL-Lab/camelparser-dialects-EGY-GLF`	ARZTB, CamelTB-Gumar	80.0	83.8	79.4	81.1
`CAMeL-Lab/camelparser-dialects-MSA-EGY-GLF`	CamelTB, PATB, ARZTB, CamelTB-Gumar	87.2	84.2	80.3	83.9

LAS (Labeled Attachment Score) on TEST
The recommended checkpoint is the all-variety model (MSA-EGY-GLF), which provides the best overall cross-dialect performance.
Model weights are compatible with CamelParser2.0 and SuPar. This repository includes a CamelParser submodule and CLI wrapper for direct inference.

🚀 CLI Inference

Clone this repository with its CamelParser submodule:

git clone --recurse-submodules https://github.com/CAMeL-Lab/camel_parser_dialects.git
cd camel_parser_dialects

If you already cloned the repository, initialize the submodule:

git submodule update --init --recursive

Create an environment and install CamelParser dependencies:

conda create -n camel-parser-dialects python=3.11.13
conda activate camel-parser-dialects
pip install -r camel_parser/requirements.txt
pip install conllu

For raw text (-f text) and cleaned whitespace-tokenized text (-f preprocessed_text), install the default CAMeL Tools morphology and disambiguation data:

camel_data -i morphology-db-msa-r13
camel_data -i disambig-bert-unfactored-msa

These CAMeL Tools data packages are not needed for -f conll, -f tokenized, or -f tokenized_tagged.

List available model aliases:

python download_models.py --list-models

Download all dialect parser models:

python download_models.py

Download one model:

python download_models.py --model msa-egy-glf

The available aliases are:

Alias	Hugging Face repo
`msa`	`CAMeL-Lab/camelparser-dialects-MSA`
`egy`	`CAMeL-Lab/camelparser-dialects-EGY`
`glf`	`CAMeL-Lab/camelparser-dialects-GLF`
`msa-egy`	`CAMeL-Lab/camelparser-dialects-MSA-EGY`
`msa-glf`	`CAMeL-Lab/camelparser-dialects-MSA-GLF`
`egy-glf`	`CAMeL-Lab/camelparser-dialects-EGY-GLF`
`msa-egy-glf`	`CAMeL-Lab/camelparser-dialects-MSA-EGY-GLF`

The CLI also accepts catib- prefixed aliases, e.g., catib-msa-egy-glf.

Parse a raw text file:

python dialect_parse_cli.py -i input.txt -f text -m msa-egy-glf > output.conllx

Parse a string:

python dialect_parse_cli.py -s "جامعة نيويورك تنشر أطلس." -f text -m msa-egy-glf

Parse cleaned, whitespace-tokenized text:

python dialect_parse_cli.py -i input_tokenized.txt -f preprocessed_text -m glf > output.conllx

Parse already-tokenized text without POS tagging or feature generation:

python dialect_parse_cli.py -s "جامعة نيويورك تنشر أطلس ." -f tokenized -m egy

Parse an existing CoNLL/CoNLL-X file:

python dialect_parse_cli.py -i input.conllx -f conll -m msa-egy > output.conllx

The CLI downloads the selected model from Hugging Face if it is not already present under models/.

📚Data

The models are trained on combinations of the following treebanks:

CamelTB (MSA): camel_treebank_1.1.zip
PATB (Penn Arabic Treebank): LDC2010T13, LDC2011T09, LDC2010T08
ARZTB (Egyptian Arabic Treebank): LDC2018T23
CamelTB-Gumar (Gulf Arabic): CamelTB-Gumar.1.0.zip

The preprocesessed data can be extracted using muddler. Once installed with pip install muddler, extract muddled files provided under data/ directory with the following files.

CamelTB (MSA):
1. Download camel_treebank_1.1.zip from:
  - https://sites.google.com/nyu.edu/camel-treebank/resources
2. Run the following command to unlock the muddled file.
```
muddler unmuddle -s camel_treebank_1.1.zip -m data/CamelTB.zip.muddle data/CamelTB.zip
```
3. Unzip the file with unzip data/CamelTB.zip -f data
PATB (Penn Arabic Treebank):
1. Download the following files from the following LDC releases:
  - atb1_v4_1_LDC2010T13.tgz: https://catalog.ldc.upenn.edu/LDC2010T13
  - atb_2_3.1_LDC2011T09.tgz: https://catalog.ldc.upenn.edu/LDC2011T09
  - atb3_v3_2_LDC2010T08.tgz: https://catalog.ldc.upenn.edu/LDC2010T08
2. Place them in a directory, e.g., ldc_files/
3. Run the following command to unlock the muddled file.
```
muddler unmuddle -s ldc_files -m data/PATB.zip.muddle data/PATB.zip
```
4. Unzip the file with unzip data/PATB.zip -d data
ARZTB (Egyptian Arabic Treebank):
1. Download bolt_arz-df_LDC2018T23.tgz from:
  - https://catalog.ldc.upenn.edu/LDC2018T23
2. Run the following command to unlock the muddled file.
```
muddler unmuddle -s bolt_arz-df_LDC2018T23.tgz -m data/arz_data.zip.muddle data/arz_data.zip
```
3. Unzip the file with unzip data/arz_data.zip -d data
CamelTB-Gumar (Gulf Arabic):
1. Download CamelTB-Gumar.1.0.zip from:
  - https://forms.gle/54WSUt7Z9m9vk6p69
2. Run the following command to unlock the muddled file.
```
muddler unmuddle -s CamelTB-Gumar.1.0.zip -m data/CamelTB-Gumar_data.zip.muddle data/CamelTB-Gumar_data.zip
```
3. Unzip the file with unzip data/CamelTB-Gumar_data.zip -data

📖 Citation

If you use this model, please cite:

@inproceedings{Elshabrawy:2026:camelparser-dialects,
    title = "{Parsing Arabic Dialects Revisited: New Benchmarks, Models, and Insights}",
    author = {Ahmed Elshabrawy and
              Go Inoue and
              Muhammed AbuOdeh and
              Nizar Habash} ,
    booktitle = {Proceedings of The 7th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT)},
    year = "2026",
    address = "Palma, Spain"
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
camel_parser @ 2f85559		camel_parser @ 2f85559
data		data
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
dialect_models.py		dialect_models.py
dialect_parse_cli.py		dialect_parse_cli.py
download_models.py		download_models.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

camel_parser_dialects

📊 Model Variants

🚀 CLI Inference

📚Data

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

camel_parser_dialects

📊 Model Variants

🚀 CLI Inference

📚Data

📖 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages