Experiments with Arabic weblogs
Introduction
- This page shows the experiments that we have made on Arabic weblogs dataset. we used different methods to analyze our collection. Therefore, in the coming paragraphs we explain experiments in details and show results that we have obtained.
Experiments
Experiment 1:
- This experiment is specific for weblogs from Maktoob (Arabic blogs service provider). The idea of this experiment came after having a look to the structure of the blog page belonging to this service provider. Mainly, Maktoob allow new subscriber for its blog services to chose given categories and associate them with the topics that could be discussed in the newly created blog. Maktoob provides for new users 17 categories and they are:
| 1 |
Personal |
| 2 |
Politics & news |
| 3 |
Art & culture |
| 4 |
Books & literature |
| 5 |
Movies, entertainment & TV |
| 6 |
Faith & religion |
| 7 |
Family & Friends |
| 8 |
Finance& Business |
| 9 |
Internet & software |
| 10 |
Life & style |
| 11 |
Music |
| 12 |
Photo & design |
| 13 |
Science & technology |
| 14 |
Sports |
| 15 |
Travel |
| 16 |
General |
| 17 |
Women |
Blog categories suggested in advance for new subscribers
- Having these categories defined, we want to know what topics Arab bloggers choose when they start there blogs, and see if this practice can help us with further experimnets. categories we tried to build a categories vector (of dimension 17) for each weblog belonging to our collection. any weblog vector contains values 1 or 0 if it does/does not fall within a given category.For example: Weblog w1 assigned categories {personal,politics&news, General}, the weblog vector should have the following form: w1 = {1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0}
- In this experiment we visualize our weblog vectors. we are going to see how they are distributed when we project them in Rad visualizer:
Fig.1 Shows distribution of weblogs in multi categories radial circle, red dots represent weblogs
We can conclude from fig. 1 that an important density of weblogs (red dots) are concentrated in the center of our categories circle. which means that most of the bloggers assign all or most possible categories to their weblogs.
Another important observation that we can draw from the figure is that we see an important mass of weblogs laying between radius's CAT_GENERAL to CAT_BOOKS_LETTARATURE. Which means that bloggers at Maktoob domain are in general more interested in certain topics. These are that are personal, general, politics, news, Art & culture, women and finally literature.
Experiment 2:
In the second experiment we are trying to analyse the importance of language in our weblogs collection. We have shown earlier that posts of our weblogs contains different languages, therefore we try to count the frequency of posts in a given language for each weblog. the language that we are concerned in are basically the ones that are significant for our collection. they are Arabic, English, French.
we label 'other_posts' posts which text seems to not belong to any of the three languages mentioned above or posts that contain links, codes...
Y: represents the amount of posts
X: represents the amount of weblogs in the collection
Fig.2 Shows the distribution of posts per weblogs in terms of languages they contain
From fig.2 we see that the amount of Arabic posts along the whole posts distribution is the most significant. English and French follows it correspondingly. We remark also that bloggers in general use other means than text in their posts. for example images or links to other resources. Our figure shows that, covered with yellow color along the posts distribution.
Experiment 3:
Taking the result of experiment 2 in consideration, we decided to apply unsupervised learning methods to our weblogs collection. Precisely, do an experiment to cluster our collection of weblogs in terms of languages they contains. Therefore we defined a feature set of 4 features :
- ARABIC_POST
- ENGLISH_POST
- FRENCH_POST
- OTHER_POST
Further, we built a training set. Then, we ran simple k-means algorithm to cluster our weblogs collection with various settings. and here are the results on Fig. 3
The experiment shows that the within cluster sum of squared error drops significantly from 19 with one cluster to around 4 with ten clusters then finally to 0.8 when clusters are 50.
We can clearly see that when we increase the number of clusters from 30 to 50. The sum of squared error decreases by small difference around 0.2. Thus, we explain that by the existence of few weblogs that have unique combination of posts in different languages and also weblogs that are distant with their properties to the most common ones.
Experiment 4:
Posting rates graph (per days): tracking the posting behavior during the war in Lebanon
Notice that the amount of posts per day in our collection has tripled when Israel has launched a military operation on Lebanon. This operation has started in July 12th. Therefore, immediately after in the days 14th till 20th of July, we see clearly (Fig. 3) that bloggers reacted heavily on that event which made the amount of posts per day increase expectingly high.
<
Fig.3 shows the amount of posts per day from July 1st till August 16th, 2006
In the forth experiment, we want to go further with the results that are shown in Fig. 3.
Therefore, we will show that this huge increase of posts per day within the collection is tightly related to posts that contain discussions and views concerning war in Lebanon.
First, we perform a per day collecting procedure of posts during this period. Second, we build a corpus of posts for each day during our period. Third, we calculate the frequency of terms that have relation to this event.
In order to make an appropriate term for war
This terms were chosen by looking at the top term frequencies during the war period. Then a set terms was formed. We made a choice of a set with 6 terms and they are:
- لبنان (Lebanon)
- إسرائيل (Israel)
- حزب الله (Hezbollah)
- الحرب (War)
- المقاومة (Resistance)
- الجنود (Soldiers)
Now, we will see how the frequency of these terms evaluate during the war period:
Fig.4 shows the term frequencies per day during war period
By looking at Fig.4 , we observe a drastically increase of term frequencies per day during period of war. This frequencies has jumped from zero values in the beginning of June to 130 after the war has started to mark the maximum during the last 2 weeks of July.
Notice that an increase of frequency of the terms, war, Hezbollah, Lebanon and Israel has the same time of occurrence, which proves that posts that discuss war relate it to Lebanon-Hezbollah and Israel.
Experiment 5:
- In this experiment we will try to see if the Arabic weblogs contain posts written in Arabic dialect. During a random preview of the blog collection we have noticed that some posts are published in Egyptian Arabic. One of the question that raised in our minds is the following: Can we prove that our blog posts contains Arabic dialects? Can we identify dialectal posts using machine learning methods? to answer these question we conducted the following experiment: we collected 200 posts written in Egyptian Arabic. These were taken mainly from blogs listed in the Egyptian blog count, then we manually selected these posts having by that an ability to defer between Egyptian Arabic (EGY) and modern standard Arabic (MSA). After collecting, we decided to build bigrams and unigrams tables with normalized term frequencies for the EGY corpus of blogs. In result we had a large list of uni/bigrams ordered with decreasing Normalized TF.
- Our aim in this experiment was to build a feature set of the most important bi/unigrams for the EGY corpus and use it for Egyptian posts detection in our collection. Therefore, we have chosen to filter our grams table and keep the ones that are specific Egyptian terms which have no use in standard Arabic language. Thus, we build a mixed table of terms (unigrams and bigrams) that has correspondingly the top 500 Egyptian Arabic unigrams and bigrams.
| Features 1-100 |
Features 101-200 |
Features 201-300 |
Features 301-400 |
Features 401-500 |
| مش
بس
اللى
ده
اللي
دى
علشان
دا
كده
زى
إللي
شوية
برضه
دي
مش عارف
عشان
كدة
انا مش
عايز
زي
لسه
بتاع
محدش
شويه
كويس
النهاردة
دلوقتى
عاوز
ومش
ازاى
بس مش
بره
معاك
معاه
واحنا
معاهم
بتوع
دة
اشوف
مافيش
هوه
ليا
صاحى
ابويا
معلش
واللى
رايح
مفيش
زى كل
ليهم
بدأ
ورا
بيعمل
جت
جه
تشوف
بيتكلم
بيقول
إللى
دلوقتي
تيته
وجيت
وده
شايف
أنا مش
امبارح
بتاعت
عايشه
معرفش
بعدين
مره
اخويا
عايش
بص
عايزة
الكلام ده
مش بس
عاوزين
بيحب
اوى
إزاى
جايين
عايزين
طبعن
شوف
بلاش
زى ما
يعنى مش
معلش يا
مش قادر
انت مش
كده و
جنبنا
ولسه
لسة
أشوف
شوفت
لابس
وإللي
وبس
|
تبص
بعد كدة
بس انت
بس انا
بعد كده
ده مش
مش ممكن
انا عاوز
بس علشان
برضه مش
مش عارفة
هيبقى
هيه
المية
ازيك
يشوف
جوه
برده
يابو
خليك
كورة
عايزه
بس كده
اللى اتقال
اللى مش
هو ده
مش كده
ده غير
مين اللى
علشان كده
اللى عايز
بس دى
كل ده
ليه مش
زي ما
بنتكلم
ماشي
نشوف
الحته
ماكنتش
يبص
برس
وقالى
بتاعته
وش
ينفعش
باين
الكبايه
العيال
بحاول
ابوك
أوى
بيعرف
ورايا
ياد
ازاي
وانته
تتقال
برة
انهاردة
دلوقت
مالوش
حيبقى
جاى
بد
بتاعتهم
وف
ميعرفوش
شايفنى
دول مش
دا مش
البتاع دا
يالا علشان
انا ايوا
بس بعد
بس دا
ما فيش
مش زي
طب ليه
انا كده
انا عايز
واللا
ملوش
باصص
بتاعه
بيفرش
مالهاش
وشه
اخدتش
بيحصل
هتقول
يادى
الزملا
إزاي
دايماً
اقدرتش
بحس
يادرش
اديك
بيتقال
|
وبس كنتش
ميعرفش
هتلاقيه
ماشيين
كانش
الكيف
بصت
مشفتهاش
بيكون
ومحدش
هتلاقي
ماليش
هيطلع
ييجي
هفضل
مالقيتش
ديه
راح مزعق
جابتهمش ولادة
آيه دا
اوى لما
كمان في
عن الي
مش واخد
مش عايز
مش حاسين
زي محمود
شايف ان
دا انا
بيعمل ايه
إللى فى
الصورة دي
الله مش
ان كدة
مصرى بس
اللى هما
عملية فَش
رقمك بأى
واحد زي
محدش قادر
مش انا
لسه كتير
يعلم بيها
مش عندك
ما تاخدش
اه دا
واللى ليه
لأ انا
وبتلاقي اللي
اصبح مش
مش لازم
وهو بيتكلم
استنى بس
بيسدد في
بنعمل حساب
كان بيعمل
ايه رايكم
كبير بس
بس محدش
بس في
ايه ده
دى زى
الراجل دة
ما تشوف
عفش مستعمل
مش على
طب ده
إعتبار إن
إللي إحنا
عايز انام
بس برضة
بس كان
للحاله دى
عشان كدا
دلوقتى مش
تانية مش
التجربة اللى
كنت شايف
دى مش
احنا فى
حاجه هتعملها
دى من
ما نقدرش
اهو برضة
و العيال
خلاص بس
مش مصرى
البوست دا
جبته و
ازاى قولت
ده انتوا
بره انا
دا فى
وجيت جايب
اديك تقول
مش هامشي
بس أنا
مش مناسب
كلام برضه
بره مصر
|
رمضان بس
نفسك مش
باللي فيه
دماغها راحت
أيامكم دي
ولا دى
البنت اللى
بكون فيها
طب فين
أنا بس
مش دلوقتى
بوجى و
ماشى فى
ايوا يا
ما تعرفش
ان الكلام
دى مصر
عايز من
وخلاص اتعجنا
بس وأنا
مالوش دعوة
معرفش ليه
الطابور ده
اللي جنبك
على ده
مكان محدش
ماشى عكس
ما عملتش
لية في
يبص على
اللى شافها
علشان انا
الست اللي
مش موضوعنا
بجد مش
اللي فات
على عفش
اوى انا
دى انا
ما اكلتش
دايما يقول
ده بقه
وما ينفعش
العيب مش
انا كمان
ما جابتهمش
دول لو
مش سامع
مش هاكل
بتاع الجرايد
من بره
وتلاقي اللي
مش فاهم
كده بس
ده و
دلوقتى انا
دي بتكتب
ما فيهاش
رايح الشغل
مش موجود
لكن ده
الى بيتصل
برضه لازم
دي صورتها
اللى راح
الحالة اللى
عارف ايه
بس الى
وصحبى نتناقش
بيتغير
بغيظ
هشرب
الجايه
بيقف
لوحدى
بيضربوا
بيلبسو
هتلاقوا
جزورى
لاقية
السكه
شايله
يطنش
فيهاش
بيجيبوا
وهوه
بتيجى
باكل
بتاعتة
بتاعتى
بيشوفه
لكدة
السودة
والعيال
هتكون
بتاعكم
زيها
فقرر
مية
عيال
|
جى
ياجدع
اللحاق
بتقوله
حاطين
عملتش
هيعرف
زيهم
وكده
الميه
مرضتش
هتعملها
حيقدروا
لية
بتتحرك
ماية
دقايق
زهقانة
إمبارح
علاقاته
واخده
واهو
بنعمل
ماكانش
ماسمعتش
العماره
بوليصة
بقعد
بيسمعوا
لزمة
هايل
هانعمل
بقم
بقاش
ودست
بقك
ماعرفش
هاكل
يجى
يجي
بيسدد
وعلشان
بتشوف
عايزاه
البادي
وتشوف
وازاى
بتتبهدل
وجاى
ترابيزة
شوفناها
جبته
ماينفعش
ساب
وماليش
انهارده
بقا
هتعمل
استنوا
بتمشى
بتمشي
بتاعى
وإيه
شايفاه
ممعكش
كويسة
بكون
بتجرى
بيقوله
والريس
مافيهوش
ريقى
تعرفش
بنبتدي
مكنتش
ومافيش
هعرف
جايب
بيقولى
واخد
مينفعش
الفيش
جابتهمش
بنعرف
سايد
للي
باللي
بتاعتها
بتدور
منطقه
لغه
خش
هويدي
تسعيرة
وبتبقى
حدوته
شايل
مزنوق
بيحبوا
مابقاش
|
Table1. 500 unigram/bigram EGY words table
- For this experiment we used a data set of 1000 EGY and MSA labeled blog posts. We trained Support Vector Machines with this data. And Further, we completed series of 10 test experiments with various settings. In each experiment setting, we made a test with a decreasing number of features from 500 to 5 features. The selection of features was made randomly for each test.These are results that we have got doing experiments on classification of posts in MSA and EGY:
| Number of features |
Correctly classified instances |
Incorrectly classified instances |
| 500 |
94.42% |
5.58% |
| 450 |
94.39% |
5.61% |
| 400 |
94.23% |
5.77% |
| 350 |
94.01% |
5.99% |
| 300 |
93.71% |
6.23% |
| 250 |
93.06% |
6.94% |
| 200 |
91.75% |
8.25% |
| 150 |
90.36% |
9.64% |
| 100 |
87.65% |
12.35% |
| 50 |
84.20% |
15.80% |
| 40 |
81.81% |
18.19% |
| 30 |
79.94% |
20.06% |
| 20 |
78.21% |
21.79% |
| 10 |
73.74% |
26.26% |
| 5 |
70.34% |
29.66% |
Table2. The results of EGY/MSA blog posts classification.
- Table 2 shows that the best results in classification was achieved when the number of features is 500, the correct classification is 94.42%. The worst classification percent is 70.34% when we used only 5 features. We notice that the percent of correct classification is increasing when we increase the number of features. Although, we can see that the difference between the correctly classified percent values is decreasing from 3.4% in the experiment tests with 5-10 features to 0.3% in tests with 450-500 features.
We can deduce from these results that increasing the number of features to more than 500, could lead the correctly classified percent to be hardly improving. In Summary, we conclude that the application of machine learning methods for automatic identification of Arabic blog posts have demonstrated the ability to solve this task with very good accuracy. Indeed, these methods are able to distinguish between blog posts in MSA and EGY.
Experiment 7: Categorising Arabic blogs Collection
- We would in this experment to develop methods that will allow us to categorise the Arabic weblogs collection. we focus on categorizing Arabic blog posts using categories from Arabic news websites. For this purpose we have chosen two Arabic news portals Aljazeera (http://www.aljazeera.net) and BBC Arabic (http://www.bbcarabic.com). Table 5 and 6 shows correspondingly a listing of articlescategories in BBC Arabic and Aljazeera.
| |
 |
| 1 |
World |
| 2 |
Middle east |
| 3 |
Sport |
| 4 |
Business |
| 5 |
Science & Technology |
Table5. BBCArabic.com categories
| |
 |
| 1 |
Arabic |
| 2 |
International |
| 3 |
Economy |
| 4 |
Sport |
| 5 |
Arts & Culture |
| 6 |
Health & Medecine |
| 7 |
Variety |
Table6. Aljazeera.net categories
Comments (0)
You don't have permission to comment on this page.