Arabic weblogs wiki

 

Arabic dialects identification in Blogs

Page history last edited by woiyl 2 yrs ago

Arabic dialects identification in Arabic blogs

 


 

Introduction:

 

The Arabic blogs contain posts in different languages. As we have seen in the section 3.5.3, most of the posts in the blogs dataset are in Arabic besides English and French. However, we also noticed that Arabic posts contain a variety of Arabic dialects. In general, the majority of posts are written in Modern Standard Arabic and other varieties of Arabic dialects such as Egyptian Arabic, Gulf Arabic and Levantine Arabic.

 

  • Modern Standard Arabic (MSA) is the formal Arabic that is written and spoken throughout the contemporary Arab world. It is the language of the news media, intellectual life, and literature. MSA is the official language of all Arab countries. It is the only form of Arabic taught in schools at all stages.

 

  • Arabic Dialects are national or regional varieties derived from Arabic which is spoken daily across the Arab region. Each dialect is learned as a first language. It sometimes differs enough from other to be mutually incomprehensible. Arabic dialects are not typically written, although a certain amount of literature (particularly plays and poetry) exists in many of them.

  

After looking at some samples from the blogs dataset, we noticed that the amount of posts in Egyptian Arabic (EGY) is considerably high in comparison with the amount of posts in other dialects. Thus, we found that the Arabic dialects are often used in the blog posts and the Egyptian Arabic is one of the major dialects found in that selection of samples.

 

In order to cover this problem, we decided to provide a test to see how many posts in EGY can be found in 500 blog posts. We considered posts composed in MSA or EGY, and we ignored posts in other Arabic dialects. After analyzing posts and counting their amounts, we found that 13% among 500 posts are in Egyptian Arabic (Figure 12).   

 

 

Figure 12: The percentage of posts in EGY and MSA

           found in a selection of 500 blog posts

 



The presence of posts in Arabic dialects, especially posts in Egyptian Arabic and the difficulties that occur when labeling large number of blog posts in a variety of Arabic dialects have motivated us to think about the following question:

  • How to identify automatically Arabic dialects in blog posts?

To answer this question, we tried to exploit machine learning methods in order to identify posts in Egyptian Arabic. In the beginning we built two corpora of blog posts. The first corpus consists of posts in Egyptian Arabic and the second corpus contains posts in Modern standard Arabic. Afterwards, we performed classification experiment to distinguish between posts in Egyptian Arabic and Modern standard Arabic.

 

 

Experimental settings and results

 

 

Preparing the dataset

The experiment’s data was collected from the blogs dataset. We proposed solutions to build a corpus of posts. The process of constructing corpus was the following:

 

  • Parsing the blogs dataset

 

The blogs dataset is mapped in XML. Therefore, we always need parsing to extract posts. We chose specifically the JDOM parser [59] for this task since it is an easy-to-use java based solution for accessing data in XML. 



  • Building Arabic posts corpus

The blogs dataset contains blogs in different languages (see section 3.5.3). This is the reason why we need the TextCat language categorizer [48] to identify posts in Arabic (including Arabic dialects) and to build the Arabic posts corpus.

Both MSA posts corpus and EGY posts corpus were built manually. This is because the TextCat language categorizer is unable to distinguish between posts in EGY and MSA. Besides, labeling posts to EGY or MSA needs knowledge of Arabic language and Egyptian dialect. The table below shows the amount of posts associated with each corpus:

 

Table 8: The amount of posts in MSA and EGY posts corpora

 

Corpus

Amount of posts

EGY blogs corpus

500

MSA blogs corpus

700



Preparing the feature set

 

The EGY blogs corpus was used to extract classification features. To accomplish this task, we build language models from the EGY blogs corpus. We used unigram and bigram words in Egyptian Arabic as features.  

The selection of these features was made manually, since it required us to select features that are not MSA words. Using this approach, we have succeeded to sort out 500 features that are distinguishing Egyptian Arabic from Modern Standard Arabic. The following table shows some of these features:

 

 

Table 9: Some features representing the EGY posts corpus

 

مش - بس - اللى - ده - اللي - دى - علشان - دا - كده - زى - إللي - شوية - برضه - دي -  مش عارف - عشان - كدة - انا مش - عايز - زي - لسه - بتاع - محدش - شويه - كويس - النهاردة - دلوقتى - عاوز - ومش - ازاى - بس مش - بره - معاك - معاه - واحنا - معاهم - بتوع - دة - اشوف - مافيشهوه - لياصاحى - ابويا - معلش - واللى - رايح - مفيش - زى كل - ليهم - ورا - بيعمل - جت - جه - تشوف - بيتكلم - بيقول - إللى - دلوقتي - تيته - وجيت - وده - شايف - أنا مش - امبارح - بتاعت - عايشه - معرفش - بعدين - مره - اخويا - عايش - بص - عايزة - الكلام ده - مش بس - عاوزين - بيحب - اوى - إزاى - جايين - عايزين - طبعن - شوف - بلاش - زى ما - يعنى مش - معلش يا - مش قادر - انت مش - كده و - جنبنا - ولسه - لسة أشوف - شوفت - لابس - وإللي - وبس - تبص

 

 

Table 9 shows some features used for classification; they are represented by words in Egyptian Arabic. Among these features, we can distinguish specific verbal forms (عاوزين, تبص, وجيت), nominal forms (دلوقتى, مافيش, بلاش, بتاع), and also pronouns (مش, دي, هوه).

 

 

Experimental settings and results

 

We started classification experiment by organizing the dataset. The MSA corpus contributed 70% to the dataset and the EGY corpus contributed 30%. The EGY corpus contributed with a small amount because part of it was used to build the feature set. Then, we chose the classification tools to conduct the experiments. These tools were provided by Weka toolkit [60, 61]. 

 

To start the classification, we provided a dataset of 1000 labeled blog posts. Then, we used the Weka implementation of the Sequential Minimal Optimization (SMO) algorithm for training a support vector classifier using RBF kernels [62, 63]. We applied this implementation by taking in consideration the best performance of SVM in text classification tasks [64].

 

To measure the performance, we made classification tests using 10 fold cross validation. For each fixed number of features, we repeated the test 10 times and then we averaged the results. The table below shows these results:

 


Table 10: The results of MSA/EGY posts classification

 

Number of features

Classification results

500

94.42%

450

94.39%

400

94.23%

350

94.01%

300

93.71%

250

93.06%

200

91.75%

150

90.36%

100

87.65%

50

84.20%

40

81.81%

30

79.94%

20

78.21%

10

73.74%

5

70.34%

 

 

The highest result of correct classification is 94.42%. It was achieved when the number of features reached the maximum (500 features). This result is reduced to 70.34% when the total number of features used was five.

 

 

Figure 13: The correct classification of MSA/EGY posts (Graph)



The plot shows that the percentage of correct classification of posts in EGY and MSA increases when the number of features increases (Figure 13). However, we notice that the performance does not improve much when the number of features exceeds 400 features. Thus, it is more likely that the correct classification can hardly augment when the number of features exceeds 500 features.

 

To summarize, we applied machine learning methods to identify Arabic dialect in blog posts.  We focused on experiments to distinguish between posts in EGY and posts in MSA. The automatic identification of EGY in blog posts yields almost 95% of correct classification. The methods we used in this experiment could be also adopted for future experiments to identify other dialects in blog posts.

 

Comments (0)

You don't have permission to comment on this page.