Arabic weblogs wiki

 

Experiments with Arabic weblogs

Page history last edited by Wael 2 yrs ago

Experiments with Arabic weblogs

 


 

Introduction

  • This page shows the experiments that we have made on Arabic weblogs dataset. we used different methods to analyze our collection. Therefore, in the coming paragraphs we explain experiments in details and show results that we have obtained.

 

Experiments

 

Experiment 1:

  • This experiment is specific for weblogs from Maktoob (Arabic blogs service provider). The idea of this experiment came after having a look to the structure of the blog page belonging to this service provider. Mainly, Maktoob allow new subscriber for its blog services to chose given categories and associate them with the topics that could be discussed in the newly created blog. Maktoob provides for new users 17 categories and they are:

 

1 Personal
2 Politics & news
3 Art & culture
4 Books & literature
5 Movies, entertainment & TV
6 Faith & religion
7 Family & Friends
8 Finance& Business
9 Internet & software
10 Life & style
11 Music
12 Photo & design
13 Science & technology
14 Sports
15 Travel
16 General
17 Women

 

 

Blog categories suggested in advance for new subscribers
  • Having these categories defined, we want to know what topics Arab bloggers choose when they start there blogs, and see if this practice can help us with further experimnets. categories we tried to build a categories vector (of dimension 17) for each weblog belonging to our collection. any weblog vector contains values 1 or 0 if it does/does not fall within a given category.For example: Weblog w1 assigned categories {personal,politics&news, General}, the weblog vector should have the following form: w1 = {1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0}

 

  • In this experiment we visualize our weblog vectors. we are going to see how they are distributed when we project them in Rad visualizer:

 

 

 

Fig.1 Shows distribution of weblogs in multi categories radial circle, red dots represent weblogs

We can conclude from fig. 1 that an important density of weblogs (red dots) are concentrated in the center of our categories circle. which means that most of the bloggers assign all or most possible categories to their weblogs.

Another important observation that we can draw from the figure is that we see an important mass of weblogs laying between radius's CAT_GENERAL to CAT_BOOKS_LETTARATURE. Which means that bloggers at Maktoob domain are in general more interested in certain topics. These are that are personal, general, politics, news, Art & culture, women and finally literature.

 

Experiment 2:

In the second experiment we are trying to analyse the importance of language in our weblogs collection. We have shown earlier that posts of our weblogs contains different languages, therefore we try to count the frequency of posts in a given language for each weblog. the language that we are concerned in are basically the ones that are significant for our collection. they are Arabic, English, French.

we label 'other_posts' posts which text seems to not belong to any of the three languages mentioned above or posts that contain links, codes...

 

 

 

Y: represents the amount of posts

X: represents the amount of weblogs in the collection

 

Fig.2 Shows the distribution of posts per weblogs in terms of languages they contain

 

From fig.2 we see that the amount of Arabic posts along the whole posts distribution is the most significant. English and French follows it correspondingly. We remark also that bloggers in general use other means than text in their posts. for example images or links to other resources. Our figure shows that, covered with yellow color along the posts distribution.

 

Experiment 3:

Taking the result of experiment 2 in consideration, we decided to apply unsupervised learning methods to our weblogs collection. Precisely, do an experiment to cluster our collection of weblogs in terms of languages they contains. Therefore we defined a feature set of 4 features :

  • ARABIC_POST
  • ENGLISH_POST
  • FRENCH_POST
  • OTHER_POST

Further, we built a training set. Then, we ran simple k-means algorithm to cluster our weblogs collection with various settings. and here are the results on Fig. 3

 

 

The experiment shows that the within cluster sum of squared error drops significantly from 19 with one cluster to around 4 with ten clusters then finally to 0.8 when clusters are 50.

We can clearly see that when we increase the number of clusters from 30 to 50. The sum of squared error decreases by small difference around 0.2. Thus, we explain that by the existence of few weblogs that have unique combination of posts in different languages and also weblogs that are distant with their properties to the most common ones.

 

Experiment 4:

Posting rates graph (per days): tracking the posting behavior during the war in Lebanon

Notice that the amount of posts per day in our collection has tripled when Israel has launched a military operation on Lebanon. This operation has started in July 12th. Therefore, immediately after in the days 14th till 20th of July, we see clearly (Fig. 3) that bloggers reacted heavily on that event which made the amount of posts per day increase expectingly high.

 

<

 

Fig.3 shows the amount of posts per day from July 1st till August 16th, 2006

 

In the forth experiment, we want to go further with the results that are shown in Fig. 3.

Therefore, we will show that this huge increase of posts per day within the collection is tightly related to posts that contain discussions and views concerning war in Lebanon.

First, we perform a per day collecting procedure of posts during this period. Second, we build a corpus of posts for each day during our period. Third, we calculate the frequency of terms that have relation to this event.

In order to make an appropriate term for war

This terms were chosen by looking at the top term frequencies during the war period. Then a set terms was formed. We made a choice of a set with 6 terms and they are:

 

  • لبنان (Lebanon)
  • إسرائيل (Israel)
  • حزب الله (Hezbollah)
  • الحرب (War)
  • المقاومة (Resistance)
  • الجنود (Soldiers)

 

Now, we will see how the frequency of these terms evaluate during the war period:

 

 

Fig.4 shows the term frequencies per day during war period

By looking at Fig.4 , we observe a drastically increase of term frequencies per day during period of war. This frequencies has jumped from zero values in the beginning of June to 130 after the war has started to mark the maximum during the last 2 weeks of July.

Notice that an increase of frequency of the terms, war, Hezbollah, Lebanon and Israel has the same time of occurrence, which proves that posts that discuss war relate it to Lebanon-Hezbollah and Israel.

 

Experiment 5:

  • In this experiment we will try to see if the Arabic weblogs contain posts written in Arabic dialect. During a random preview of the blog collection we have noticed that some posts are published in Egyptian Arabic. One of the question that raised in our minds is the following: Can we prove that our blog posts contains Arabic dialects? Can we identify dialectal posts using machine learning methods? to answer these question we conducted the following experiment: we collected 200 posts written in Egyptian Arabic. These were taken mainly from blogs listed in the Egyptian blog count, then we manually selected these posts having by that an ability to defer between Egyptian Arabic (EGY) and modern standard Arabic (MSA). After collecting, we decided to build bigrams and unigrams tables with normalized term frequencies for the EGY corpus of blogs. In result we had a large list of uni/bigrams ordered with decreasing Normalized TF.
  • Our aim in this experiment was to build a feature set of the most important bi/unigrams for the EGY corpus and use it for Egyptian posts detection in our collection. Therefore, we have chosen to filter our grams table and keep the ones that are specific Egyptian terms which have no use in standard Arabic language. Thus, we build a mixed table of terms (unigrams and bigrams) that has correspondingly the top 500 Egyptian Arabic unigrams and bigrams.

 

 

Features 1-100 Features 101-200 Features 201-300 Features 301-400 Features 401-500
مش

بس

اللى

ده

اللي

دى

علشان

دا

كده

زى

إللي

شوية

برضه

دي

مش عارف

عشان

كدة

انا مش

عايز

زي

لسه

بتاع

محدش

شويه

كويس

النهاردة

دلوقتى

عاوز

ومش

ازاى

بس مش

بره

معاك

معاه

واحنا

معاهم

بتوع

دة

اشوف

مافيش

هوه

ليا

صاحى

ابويا

معلش

واللى

رايح

مفيش

زى كل

ليهم

بدأ

ورا

بيعمل

جت

جه

تشوف

بيتكلم

بيقول

إللى

دلوقتي

تيته

وجيت

وده

شايف

أنا مش

امبارح

بتاعت

عايشه

معرفش

بعدين

مره

اخويا

عايش

بص

عايزة

الكلام ده

مش بس

عاوزين

بيحب

اوى

إزاى

جايين

عايزين

طبعن

شوف

بلاش

زى ما

يعنى مش

معلش يا

مش قادر

انت مش

كده و

جنبنا

ولسه

لسة

أشوف

شوفت

لابس

وإللي

وبس

 

تبص

بعد كدة

بس انت

بس انا

بعد كده

ده مش

مش ممكن

انا عاوز

بس علشان

برضه مش

مش عارفة

هيبقى

هيه

المية

ازيك

يشوف

جوه

برده

يابو

خليك

كورة

عايزه

بس كده

اللى اتقال

اللى مش

هو ده

مش كده

ده غير

مين اللى

علشان كده

اللى عايز

بس دى

كل ده

ليه مش

زي ما

بنتكلم

ماشي

نشوف

الحته

ماكنتش

يبص

برس

وقالى

بتاعته

وش

ينفعش

باين

الكبايه

العيال

بحاول

ابوك

أوى

بيعرف

ورايا

ياد

ازاي

وانته

تتقال

برة

انهاردة

دلوقت

مالوش

حيبقى

جاى

بد

بتاعتهم

وف

ميعرفوش

شايفنى

دول مش

دا مش

البتاع دا

يالا علشان

انا ايوا

بس بعد

بس دا

ما فيش

مش زي

طب ليه

انا كده

انا عايز

واللا

ملوش

باصص

بتاعه

بيفرش

مالهاش

وشه

اخدتش

بيحصل

هتقول

يادى

الزملا

إزاي

دايماً

اقدرتش

بحس

يادرش

اديك

بيتقال

 

وبس كنتش

ميعرفش

هتلاقيه

ماشيين

كانش

الكيف

بصت

مشفتهاش

بيكون

ومحدش

هتلاقي

ماليش

هيطلع

ييجي

هفضل

مالقيتش

ديه

راح مزعق

جابتهمش ولادة

آيه دا

اوى لما

كمان في

عن الي

مش واخد

مش عايز

مش حاسين

زي محمود

شايف ان

دا انا

بيعمل ايه

إللى فى

الصورة دي

الله مش

ان كدة

مصرى بس

اللى هما

عملية فَش

رقمك بأى

واحد زي

محدش قادر

مش انا

لسه كتير

يعلم بيها

مش عندك

ما تاخدش

اه دا

واللى ليه

لأ انا

وبتلاقي اللي

اصبح مش

مش لازم

وهو بيتكلم

استنى بس

بيسدد في

بنعمل حساب

كان بيعمل

ايه رايكم

كبير بس

بس محدش

بس في

ايه ده

دى زى

الراجل دة

ما تشوف

عفش مستعمل

مش على

طب ده

إعتبار إن

إللي إحنا

عايز انام

بس برضة

بس كان

للحاله دى

عشان كدا

دلوقتى مش

تانية مش

التجربة اللى

كنت شايف

دى مش

احنا فى

حاجه هتعملها

دى من

ما نقدرش

اهو برضة

و العيال

خلاص بس

مش مصرى

البوست دا

جبته و

ازاى قولت

ده انتوا

بره انا

دا فى

وجيت جايب

اديك تقول

مش هامشي

بس أنا

مش مناسب

كلام برضه

بره مصر

 

رمضان بس

نفسك مش

باللي فيه

دماغها راحت

أيامكم دي

ولا دى

البنت اللى

بكون فيها

طب فين

أنا بس

مش دلوقتى

بوجى و

ماشى فى

ايوا يا

ما تعرفش

ان الكلام

دى مصر

عايز من

وخلاص اتعجنا

بس وأنا

مالوش دعوة

معرفش ليه

الطابور ده

اللي جنبك

على ده

مكان محدش

ماشى عكس

ما عملتش

لية في

يبص على

اللى شافها

علشان انا

الست اللي

مش موضوعنا

بجد مش

اللي فات

على عفش

اوى انا

دى انا

ما اكلتش

دايما يقول

ده بقه

وما ينفعش

العيب مش

انا كمان

ما جابتهمش

دول لو

مش سامع

مش هاكل

بتاع الجرايد

من بره

وتلاقي اللي

مش فاهم

كده بس

ده و

دلوقتى انا

دي بتكتب

ما فيهاش

رايح الشغل

مش موجود

لكن ده

الى بيتصل

برضه لازم

دي صورتها

اللى راح

الحالة اللى

عارف ايه

بس الى

وصحبى نتناقش

بيتغير

بغيظ

هشرب

الجايه

بيقف

لوحدى

بيضربوا

بيلبسو

هتلاقوا

جزورى

لاقية

السكه

شايله

يطنش

فيهاش

بيجيبوا

وهوه

بتيجى

باكل

بتاعتة

بتاعتى

بيشوفه

لكدة

السودة

والعيال

هتكون

بتاعكم

زيها

فقرر

مية

عيال

 

جى

ياجدع

اللحاق

بتقوله

حاطين

عملتش

هيعرف

زيهم

وكده

الميه

مرضتش

هتعملها

حيقدروا

لية

بتتحرك

ماية

دقايق

زهقانة

إمبارح

علاقاته

واخده

واهو

بنعمل

ماكانش

ماسمعتش

العماره

بوليصة

بقعد

بيسمعوا

لزمة

هايل

هانعمل

بقم

بقاش

ودست

بقك

ماعرفش

هاكل

يجى

يجي

بيسدد

وعلشان

بتشوف

عايزاه

البادي

وتشوف

وازاى

بتتبهدل

وجاى

ترابيزة

شوفناها

جبته

ماينفعش

ساب

وماليش

انهارده

بقا

هتعمل

استنوا

بتمشى

بتمشي

بتاعى

وإيه

شايفاه

ممعكش

كويسة

بكون

بتجرى

بيقوله

والريس

مافيهوش

ريقى

تعرفش

بنبتدي

مكنتش

ومافيش

هعرف

جايب

بيقولى

واخد

مينفعش

الفيش

جابتهمش

بنعرف

سايد

للي

باللي

بتاعتها

بتدور

منطقه

لغه

خش

هويدي

تسعيرة

وبتبقى

حدوته

شايل

مزنوق

بيحبوا

مابقاش

 

 

Table1. 500 unigram/bigram EGY words table

 

  • For this experiment we used a data set of 1000 EGY and MSA labeled blog posts. We trained Support Vector Machines with this data. And Further, we completed series of 10 test experiments with various settings. In each experiment setting, we made a test with a decreasing number of features from 500 to 5 features. The selection of features was made randomly for each test.These are results that we have got doing experiments on classification of posts in MSA and EGY:

 

Number of features Correctly classified instances Incorrectly classified instances
500 94.42% 5.58%
450 94.39% 5.61%
400 94.23% 5.77%
350 94.01% 5.99%
300 93.71% 6.23%
250 93.06% 6.94%
200 91.75% 8.25%
150 90.36% 9.64%
100 87.65% 12.35%
50 84.20% 15.80%
40 81.81% 18.19%
30 79.94% 20.06%
20 78.21% 21.79%
10 73.74% 26.26%
5 70.34% 29.66%

 

 

Table2. The results of EGY/MSA blog posts classification.

 

  • Table 2 shows that the best results in classification was achieved when the number of features is 500, the correct classification is 94.42%. The worst classification percent is 70.34% when we used only 5 features. We notice that the percent of correct classification is increasing when we increase the number of features. Although, we can see that the difference between the correctly classified percent values is decreasing from 3.4% in the experiment tests with 5-10 features to 0.3% in tests with 450-500 features.

We can deduce from these results that increasing the number of features to more than 500, could lead the correctly classified percent to be hardly improving. In Summary, we conclude that the application of machine learning methods for automatic identification of Arabic blog posts have demonstrated the ability to solve this task with very good accuracy. Indeed, these methods are able to distinguish between blog posts in MSA and EGY.

 

 

 

 

Experiment 7: Categorising Arabic blogs Collection

  • We would in this experment to develop methods that will allow us to categorise the Arabic weblogs collection. we focus on categorizing Arabic blog posts using categories from Arabic news websites. For this purpose we have chosen two Arabic news portals Aljazeera (http://www.aljazeera.net) and BBC Arabic (http://www.bbcarabic.com). Table 5 and 6 shows correspondingly a listing of articlescategories in BBC Arabic and Aljazeera.

 

 
1 World
2 Middle east
3 Sport
4 Business
5 Science & Technology

 

 

Table5. BBCArabic.com categories

 

 
1 Arabic
2 International
3 Economy
4 Sport
5 Arts & Culture
6 Health & Medecine
7 Variety

 

 

Table6. Aljazeera.net categories

 

Comments (0)

You don't have permission to comment on this page.