Skip to content

CAMeL-Lab/camel_parser_dialects

Repository files navigation

camel_parser_dialects

CamelParser-Dialects is a state-of-the-art dependency parsing model for dialectal Arabic and Modern Standard Arabic (MSA), designed under the CATiB dependency formalism.

It is based on the biaffine attention parser architecture introduced by Dozat and Manning (2017), implemented using SuPar. The model leverages CamelBERT-MIX, a pretrained language model trained on a large and diverse Arabic corpus.

Full details are available in our paper: "Parsing Arabic Dialects Revisited: New Benchmarks, Models, and Insights"

📊 Model Variants

Checkpoint Training Data MSA EGY GLF AVG
CAMeL-Lab/camelparser-dialects-MSA CamelTB, PATB 87.3 73.0 73.3 77.9
CAMeL-Lab/camelparser-dialects-EGY ARZTB 79.2 83.9 68.7 77.3
CAMeL-Lab/camelparser-dialects-GLF CamelTB-Gumar 65.4 58.7 73.8 66.0
CAMeL-Lab/camelparser-dialects-MSA-EGY CamelTB, PATB, ARZTB 87.1 84.4 70.1 79.8
CAMeL-Lab/camelparser-dialects-MSA-GLF CamelTB, PATB, CamelTB-Gumar 87.2 74.4 81.0 80.9
CAMeL-Lab/camelparser-dialects-EGY-GLF ARZTB, CamelTB-Gumar 80.0 83.8 79.4 81.1
CAMeL-Lab/camelparser-dialects-MSA-EGY-GLF CamelTB, PATB, ARZTB, CamelTB-Gumar 87.2 84.2 80.3 83.9
  • LAS (Labeled Attachment Score) on TEST
  • The recommended checkpoint is the all-variety model (MSA-EGY-GLF), which provides the best overall cross-dialect performance.
  • Model weights are compatible with CamelParser2.0 and SuPar. This repository includes a CamelParser submodule and CLI wrapper for direct inference.

🚀 CLI Inference

Clone this repository with its CamelParser submodule:

git clone --recurse-submodules https://github.com/CAMeL-Lab/camel_parser_dialects.git
cd camel_parser_dialects

If you already cloned the repository, initialize the submodule:

git submodule update --init --recursive

Create an environment and install CamelParser dependencies:

conda create -n camel-parser-dialects python=3.11.13
conda activate camel-parser-dialects
pip install -r camel_parser/requirements.txt
pip install conllu

For raw text (-f text) and cleaned whitespace-tokenized text (-f preprocessed_text), install the default CAMeL Tools morphology and disambiguation data:

camel_data -i morphology-db-msa-r13
camel_data -i disambig-bert-unfactored-msa

These CAMeL Tools data packages are not needed for -f conll, -f tokenized, or -f tokenized_tagged.

List available model aliases:

python download_models.py --list-models

Download all dialect parser models:

python download_models.py

Download one model:

python download_models.py --model msa-egy-glf

The available aliases are:

Alias Hugging Face repo
msa CAMeL-Lab/camelparser-dialects-MSA
egy CAMeL-Lab/camelparser-dialects-EGY
glf CAMeL-Lab/camelparser-dialects-GLF
msa-egy CAMeL-Lab/camelparser-dialects-MSA-EGY
msa-glf CAMeL-Lab/camelparser-dialects-MSA-GLF
egy-glf CAMeL-Lab/camelparser-dialects-EGY-GLF
msa-egy-glf CAMeL-Lab/camelparser-dialects-MSA-EGY-GLF

The CLI also accepts catib- prefixed aliases, e.g., catib-msa-egy-glf.

Parse a raw text file:

python dialect_parse_cli.py -i input.txt -f text -m msa-egy-glf > output.conllx

Parse a string:

python dialect_parse_cli.py -s "جامعة نيويورك تنشر أطلس." -f text -m msa-egy-glf

Parse cleaned, whitespace-tokenized text:

python dialect_parse_cli.py -i input_tokenized.txt -f preprocessed_text -m glf > output.conllx

Parse already-tokenized text without POS tagging or feature generation:

python dialect_parse_cli.py -s "جامعة نيويورك تنشر أطلس ." -f tokenized -m egy

Parse an existing CoNLL/CoNLL-X file:

python dialect_parse_cli.py -i input.conllx -f conll -m msa-egy > output.conllx

The CLI downloads the selected model from Hugging Face if it is not already present under models/.

📚Data

The models are trained on combinations of the following treebanks:

The preprocesessed data can be extracted using muddler. Once installed with pip install muddler, extract muddled files provided under data/ directory with the following files.

  • CamelTB (MSA):

    1. Download camel_treebank_1.1.zip from:

    2. Run the following command to unlock the muddled file.

      muddler unmuddle -s camel_treebank_1.1.zip -m data/CamelTB.zip.muddle data/CamelTB.zip
    3. Unzip the file with unzip data/CamelTB.zip -f data

  • PATB (Penn Arabic Treebank):

    1. Download the following files from the following LDC releases:

    2. Place them in a directory, e.g., ldc_files/

    3. Run the following command to unlock the muddled file.

      muddler unmuddle -s ldc_files -m data/PATB.zip.muddle data/PATB.zip
    4. Unzip the file with unzip data/PATB.zip -d data

  • ARZTB (Egyptian Arabic Treebank):

    1. Download bolt_arz-df_LDC2018T23.tgz from:

    2. Run the following command to unlock the muddled file.

      muddler unmuddle -s bolt_arz-df_LDC2018T23.tgz -m data/arz_data.zip.muddle data/arz_data.zip
    3. Unzip the file with unzip data/arz_data.zip -d data

  • CamelTB-Gumar (Gulf Arabic):

    1. Download CamelTB-Gumar.1.0.zip from:

    2. Run the following command to unlock the muddled file.

      muddler unmuddle -s CamelTB-Gumar.1.0.zip -m data/CamelTB-Gumar_data.zip.muddle data/CamelTB-Gumar_data.zip
    3. Unzip the file with unzip data/CamelTB-Gumar_data.zip -data

📖 Citation

If you use this model, please cite:

@inproceedings{Elshabrawy:2026:camelparser-dialects,
    title = "{Parsing Arabic Dialects Revisited: New Benchmarks, Models, and Insights}",
    author = {Ahmed Elshabrawy and
              Go Inoue and
              Muhammed AbuOdeh and
              Nizar Habash} ,
    booktitle = {Proceedings of The 7th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT)},
    year = "2026",
    address = "Palma, Spain"
}

About

CamelParser-Dialects is a state-of-the-art dependency parsing model for dialectal Arabic.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages