Arabic weblogs wiki

 

Discovering topics in Arabic blogs

Page history last edited by woiyl 2 yrs ago

 

Discovering topics in Arabic blogs

 


 

 

Introduction

 

 

Arabic blogs contain posts with rich content and various subjects. The experiments on the content of Arabic blogs are relevant to discover topics discussed by Arab bloggers. in order to understand the nature of Arabic blogs better, the important questions that we can formulate are the following:

  • Can we distinguish Arabic blogs to personal and non-personal blogs?

  • What are the categories found in non-personal blogs?

In this section, we will describe measurements and experiments that we have carried on the content of Arabic blogs. Measurements were provided to highlight the predominant themes found in a collection of blogs. Besides, experiments were conducted to characterize blog posts with personal or non personal content and to identify certain topics in blog posts. Moreover, we will report the results of these experiments and then we will suggest in the end of the section solutions to improve them.

 

 

Topics in Arabic blogs

To show that Arabic blogs discuss different topics, we decided to provide measurements about topics in Arabic blogs. We adopted the same approach that was used to survey active online Arabic forums (see section 3.2).

 

We examined 1480 Arabic blogs from the Arabic blogs dataset. The blogs were selected manually and categorized based on blogs predominant theme and content. Each blog is found to belong to one of 7 main categories.

Figure 14 describes the main topics that we could find in blogs and the results associated with each topic: 

 

 

Figure 14: Statistics about various topics found in Arabic blogs

 

 

The results of the survey demonstrated that Arabic blogs discuss various topics. Blogs dealing with personal issues are predominant among all blogs.  They represent 43.9% of the blogs assessed in the survey. Further, blogs publishing political subjects correspond to 21.6% and blogs talking about arts and culture describe 14%. Blogs that are related to science and technology represent correspondingly 7.4% and 5.2%. And finally, photoblogs and entertainment blogs represent 2.4% and 1.6% respectively.

 

 

 

Personal/Non-personal  posts classification

 

  

The measurement provided in the previous subsection demonstrated that personal blogs represent the majority of blog types in the survey. We noticed that personal blogs differ from other blogs (non-personal) in many aspects. This has motivated us to examine the particularities of personal and non-personal blogs posts, and to investigate means to distinguish between them.

 

To achieve this task, we have explored machine learning algorithms to identify personal and non-personal posts in the Arabic blogs dataset. First of all, we built two corpora of blog posts. The first corpus consists of personal posts and the second corpus contains non- personal posts. Afterwards, we proceeded with the experiment using classification methods to distinguish between personal and non-personal Arabic blogs posts.

 

Preparing the dataset

The experimental data was collected from the Arabic blogs dataset. We used solutions that we have implemented in the previous experiments to extract posts. We went through the following procedures:

  • Processing the blogs dataset

 

        The experimental data was extracted from the blogs dataset. We used posts from the blogging platform Maktoob and MSN spaces. This is    

            because we found that MSN spaces specify many personal posts and in contrary Maktoob, which specifies many non-personal posts.

 

  • Building personal and non-personal posts corpora

We carried on a manual selection in order to distinguish between personal and non-personal posts. This procedure was convenient for building corpora. 

Both procedures enabled us to prepare the data for the experiment. We have built two corpora with equal amount of posts for each one (Table 11):

Table 11: The amount of posts in personal and non-personal corpora

  

Corpus

Amount of posts

Personal posts corpus

500

Non-personal posts corpus

500

 

 

Preparing the feature set

The corpus of personal posts was used to construct the feature set. Personal posts have distinguishing properties when compared to non-personal posts. Among these properties, we have noticed a significant use of Arabic dialects in personal posts. This means that almost 75% of posts in the corpus of personal posts are written in Arabic dialects.

 

To prepare the feature set, we used the corpus of personal posts to build unigram and bigram models. After that, we combined both models into one model to represent this corpus. This procedure was helpful because we could choose the most frequent n-grams in the corpus as features. The table below shows a sample of the feature set:

 

 

 



Table 12: Some features representing the corpus of personal posts

 

من - في - على - ما - و - لا - بس - كل - اللي - انا - ولا - الله - أن - يا - شي - ان - عن - لي - مع - يوم - والله - هذا - او -كان - بعد - لو - يعني - وش - حتى - فيه - المهم - لك - لم - الي - كنت - اليوم - اني - مو - قلت - الا - أنا - له - وانا - هو - التي - قلبي - اذا - الذي - كيف - هي - الى علي - كانت - فيها -لكم - بين - الحب - عاد - غير - إن - إلى - عليكم - عليه - لها - هنا - لكن - أو - عندما - هع - واحد - قبل - إلا - انه - تلك - مرا - عشان - راح - الناس - احد - فلا - هذه - انت - كم - هل - قال - ترى - اول - وفي - قد - نفسي - هذي - تقول - الدنيا - وما - لازم - ومن - عليك - يكون - هناك - عليها

 

 

After we have finished preparing the feature set, we observed that almost 20% of the feature set contains words in Arabic dialects. Table12 shows features identified as Arabic dialect words, they are specified in bold character. Besides, we have also seen an important use of personal pronouns and pronouns in singular forms (لك, لي, انا), nouns and adjectives in singular forms (نفسي, قلبي) and also verbs and adverbs in singular forms (ترى, قلت, كنت).

 

 

Experimental settings and results

 

In the experiment of personal/non-personal posts classification, we first prepared the dataset from personal and non-personal posts corpora. We used in total 1000 blog posts equally divided between these two classes. Afterwards, we chose the classification tools to conduct this experiment. For this purpose, we kept on using the settings as in the previous experiment (Weka implementation of the SMO for training SVM, see section 4.1) because SVM showed the best performance in text classification tasks [64].

 

To measure the performance, we performed classification tests using 10 fold cross validation. For each fixed number of features, we repeated the test 10 times and then we averaged the results. The table below shows these results:         

Table 13: The results of Personal/Non-personal posts classification

 

Number of features

Classification results

500

90.3%

400

90%

300

89.5%

200

88.8%

100

86.6%

50

81%

40

79.3%

30

74.7%

20

71.4%

10

65.6%

 

The classification performance has reached 90.3% when the number of features was complete (500 features).  However, it was less than 70% when 10 features were used for classification. Generally speaking, the classification performance has been improving when we have increased the number of features gradually. Nonetheless, there are some possibilities that we could think of in order to boost up the classification results.

 

 

Topic identification in blog posts

 

 

The Arabic blogs dataset contains posts that cover various topics. After we have examined many samples from the blogs dataset, we noted that non-personal blogs discuss numerous issues. They cover in the first instance news with various subjects: politics, economy, sport, technology, health, culture and religion. The variety of topics in Arabic blogs and their usage in posts motivated us to investigate means to discover these topics in non-personal blog posts.    

 

In the following paragraphs, we will explore machine learning methods to identify certain topics in non-personal blogs. we start by providing information about the labeled data that was used to conduct the experiments. Then, we report about the system that was exploited to build this data. After that, we conduct experiments to see what categories can be identified in blog posts. Finally, we discuss the results and suggest new ideas to enhance topic identification in non-personal blog posts.         

 

 

Preparing the dataset

 

We need labeled data to conduct the experiments. This data can be harvested from web portals that provide online news such as BBC Arabic[1] and Aljazeera[2]. This news is categorized as demonstrated in the table below:

Table 14: The news categories of Aljazeera and BBC Arabic

 

Aljazeera news categories

BBC Arabic news categories

1.      Arabic (Arab world political news)

2.      Arts and culture

3.      Economy

4.      Health and medicine

5.      International (world political news)

6.      Sport

 

1.      Economy and business

2.      Middle east (Middle East political news)

3.      Science and technology

4.      Sport news

5.      World news (world political news)

 

 

 

Table 14 demonstrates that both Aljazeera and BBC Arabic contain news with various labels. Labeled news can be used to conduct the experiments. Therefore, news needs to be extracted from web pages of Aljazeera and BBC Arabic and collected to build the experimental dataset. This task is handled by the system illustrated in the figure below:

 

Figure 15: The architecture of the system for building the news dataset

 

 

Figure 15 shows the system for building news dataset. This system consists of two modules: news feeds and news content. The first module manages the access to the feeds of the web portals Aljazeera and BBC Arabic. The second module extracts the content from these feeds and assigns it to the news dataset.

Basically, the system works as follows:

 

  • The news feeds module connects to the web portals Aljazeera and BBC Arabic (1). After that, it retrieves the links to their feeds (2, 3). These feeds contain news that Aljazeera and BBC Arabic broadcast periodically. Further, this module saves these links and transfers them to the news content module (4).

 

  • The news content module uses the received information to update the local index (5). The local index holds all the links that the system has accessed before. While updating the local index, this module connects again to the web portals in order to collect the news content from the feeds (6). When the news content is available, it is first processed by the system (7) and then it is stored locally (8).

The current system was used actively to prepare two collections of online news articles. These are collection are Aljazeera and BBC Arabic. 

 

 

Aljazeera collection

 

The collection contains more than 4680 Arabic articles. Each article belongs to one of Aljazeera news categories. The categories that have a large amount of articles in the collection deal with political news; these are both international and Arabic political news. They represent 50% of the whole amount of articles in the collection. The rest of the collection is constituted by 37% of sport and economy articles and 13% of articles shared by health and medicine, arts and culture and other news. The figure below describes Aljazeera collection:

 

 

Figure 16: The amount of articles at each category of Aljazeera.net

 

 

BBC Arabic collection

This collection contains around 2000 Arabic articles. Each article belongs to one the of the BBC Arabic news categories. There are two categories that represent 80% of the whole amount of articles. These are world and Middle East political news. Articles from other categories correspond to 20% of the whole collection. They are represented by economy and business with 8%, science and technology with 7% and finally sport news with 5%. The figure below illustrates the structure of collection:

 



Figure 17: The amount of articles at each category of BBCArabic.com

 

Both the blogs dataset and the news dataset were used in the experiment.

Basically, we built the training sets using the online news articles from Aljazeera and BBC Arabic collections. In addition, we created two evaluation sets. The first set represents posts belonging to the blogs dataset. The second set represents articles taken from the news dataset. In the coming paragraphs, we describe the procedures that were used in order to prepare these sets:

 

 

Aljazeera training set

This set contains the training data of the categories defined in the news portal Aljazeera.net. There are in total six training sets with the following topics: Arabic, Arts/Culture, Economy, Health/Medicine, International and Sport. The training set of each topic consist of 200 samples. 50% of the samples are labeled with one topic and the other 50% of samples are labeled with the remaining five. The structure of Aljazeera training set is shown in the table below:

 

Table 15: The training sets of Aljazeera topics (six training sets)

 

Training set (articles with topics)

Arabic

Arts

&
Culture

Economy

Health
&
Medicine

International

Sport

Arabic

50%

10%

10%

10%

10%

10%

Arts/Culture

10%

50%

10%

10%

10%

10%

Economy

10%

10%

50%

10%

10%

10%

Health/Medicine

10%

10%

10%

50%

10%

10%

International

10%

10%

10%

10%

50%

10%

Sport

10%

10%

10%

10%

10%

50%

 

 

BBC Arabic training set

All categories of the BBC Arabic are present in the set.  We have five training sets represented by the following topics: Middle East, economy/business, science/technology, sport and world. For each one of them, there are 200 samples available for the training. 50% of the samples are labeled with one topic and the other 50% are labeled with the five remaining topics. The table below shows this structure for all topics:

 

 

 

 

 

 

 

 

Table 16: The training sets of BBC Arabic topics (five training sets)

 

Training set (articles with topics)

Middle East

Economy

&

Business

Science

&

Technology

Sport News

World News

Middle East

50%

12.5%

12.5%

12.5%

12.5%

Economy/Business

12.5%

50%

12.5%

12.5%

12.5%

Science/Technology

12.5%

12.5%

50%

12.5%

12.5%

Sport

12.5%

12.5%

12.5%

50%

12.5%

World

12.5%

12.5%

12.5%

12.5%

50%

 

 

Evaluation set

 

We used two evaluation sets to assess the performance of the classification. The first set contained 100 samples. It consisted of blog posts manually selected for each topic. For Aljazeera categories, we needed six test sets. Each one of them contains 50% of posts labeled with one topic and the remaining 50% with the other five topics. An example is provided below for the evaluation set: 

Table 17: The evaluation set of Aljazeera topic “Arabic”

 

Evaluation set(posts with topics)

Arabic

Arabic

50%

Arts / Culture

10%

Economy

10%

Health / Medicine

10%

International

10%

Sport

10%

 

 

To evaluate classification with BBC Arabic categories, we made in total 5 data sets; each one has 50% of posts labeled with one topic and the four remaining topics represent 50% with 12.5% for each one. The example below describes data proportions used to make the evaluation set for posts talking about Middle East news:

Table 18: The evaluation set of BBC Arabic topic “Middle East

 

Evaluation set(posts with topics)

Middle East

Middle East

50%

Economy / Business

12.5%

Science / Technology

12.5%

Sport

12.5%

World

12.5%

 

 

The second set consisted of 100 samples representing news articles. Similarly to what was done previously to the set of blog posts. We prepared six test sets with the same configuration (Table 17) for Aljazeera categories and five test sets with the same configuration (Table 18) for BBC Arabic categories.

   

 

Preparing the feature set

We used data from Aljazeera and BBC Arabic collections to prepare feature sets. First, we gathered news articles to build up a corpus for each topic. Second, we constructed unigram and bigram models from the corpus of each topic separately. Finally, we used a combination of unigram and bigram models in order to represent each corpus.

After performing this task, we obtained in total six feature sets for six topics of Aljazeera and five feature sets for five topics of the BBC Arabic. We adopted a common approach to select features for topics. For each topic, we simply chose the most frequent n-grams that occur in the corpus (words with the highest frequency values) to become classification features. This means that we looked at the corresponding model of certain topic. Then, we defined a threshold for the number of features that have to be assigned to the feature set. Further, we supplied for each topic a feature set that consists of 250 features and we proceeded with the experiment.

 

Experimental settings and results

 

To accomplish the experiment, we first selected for each topic of BBC Arabic and Aljazeera the appropriate training data. Then, after specifying the topic we want to identify, we used Weka implementation of the SMO for training the SVM (see section 4.1). In the end, we assessed the learner of each topic with the corresponding test data using 10 fold cross validation. The classification results of topic identification in blog posts and news articles using Aljazeera dataset are described in the table below:

 

Table 19: The classification results of Aljazeera topics

 

Topic

Classification results

(posts)

Classification results
(articles)

Arabic

59%

97.5%

Arts / Culture

69%

91%

Economy

79%

93.5%

Health / Medicine

84%

94.5%

International

49%

72%

Sport

72%

98%

 

 

When looking at the results, we observe that automatic identification of topics in posts using our method is acceptable for subjects such as health/medicine (84%), economy (79%). However, it is insufficient for themes like sport (72%), arts/culture (69%) and very poor for Arabic and international topics (59% and 49% correspondingly). On the other hand, the classification results of topics using news articles are acceptable for the majority of subjects with an exception for international topics (72%).   

  

Further, we provide the classification results of topic identification blog posts and news articles using the BBC Arabic dataset:

 

Table 20: The classification results of BBC Arabic topics

Topic

Classification results

(posts)

Classification results
(articles)

Economy / Business

77%

80.5%

Middle East

66%

88.5%

Science / Technology

64%

80.5%

Sport

51%

92.5%

World

49%

61%

 

 

In the category of posts, the table above shows that topics related to economy and business reaches 77% of correct classification. This is the best result among other topics of the BBC Arabic. These topics come with insufficient classification results with the minimum (49%) for world news and the highest (66%) for Middle East news. The classification results of news articles reveal that the topic Sport has the best classification result (92.5%) followed by Middle East news (88.5%), economy/business and science/technology with the same result (80.5%). However, the worst classification result was related with the topic world news.  

To understand the results of topic identification in blog posts, we first look at result tables. Then, we compare the topics that are common between Aljazeera and BBC Arabic as can be seen below:  

Table 21: The classification results of blog posts and news articles 

Topic

Aljazeera
classification results

BBC Arabic
classification results

 

posts

articles

Posts

articles

Economy / Business

79%

93.5%

77%

80.5%

Sport

72%

98%

51%

92.5%

International / World

49%

72%

49%

61%

Arabic / Middle East

59%

97%

66%

88.5%

 

The classification results of Aljazeera dataset are better than the classification results of the BBC Arabic dataset (Table 21). These results include both blog posts and news articles. Besides, the topic international/world has the lowest classification results and the topic Arabic/Middle East does not exceed 66% of correct classification among posts. In contrary to previous results, the topic health/medicine of Aljazeera has the best classification results 84% (Table 19) and the topic economy/business has more than 77% correct classification. Therefore, we consider that posts with topics related to economy, business, medicine and health are easier to identify than posts related to topics Arabic and international issues.  

 

To explain the poor classification performance for certain topics, we will examine the features used for topics that gave bad results. These are topics Middle East and world for the BCC Arabic. We provide in the tables below 13 relevant features that were used for the classification of these topics:   

Table 22: Selected features of the topic “Middle East” (BBC Arabic)

Translated feature

Feature (Frequency)

 

 

Table 23: Selected features of the topic “World” (BBC Arabic)

 

Translated feature

Feature (Frequency)

 

 

 

We notice from the tables above that the topic Middle East contains features that are far related to the Middle East region itself and these are United States, Somali, Sudan, and Darfur. Besides, there are feature that correspond to person names for example Bush[3] (George W. Bush) but do not seem to have direct relation with the Middle East. Moreover, feature like BC (BBC) is more appropriate to the BBC articles than to word related to Middle East issues. Conversely, the topic World has features that are involved with the Middle East region. These are in the first instance Iraq, Iran, Middle East. In addition, the person name Olmert[4] (Ehud Olmert) is strongly associated with issues of the Middle East (Israel) than with issues of the World.

 

Further, we study samples from the feature set of two topics in Aljazeera. These are Arabic and International. The following tables describe few features used for the classification:   

 

Table 24: Selected features of the topic “Arabic” (Aljazeera)

 

Translated feature

Feature (Frequency)

 

 

Table 25: Selected features of the topic “International” (Aljazeera)

 

Translated feature

Feature (Frequency)

Some features that we have used to identify posts with topic Arabic do not concern only Arabic regional issues. They also have relation with international involvements in the Middle East. Basically, they are related to Iraq and more specifically to the Iraq war[5]. These are for example: Bush and U.S. troops. Or, they are connected to the Arab-Israeli conflict[6] for instance: Israel. In contrast, features that are used to determine posts subjecting international issues are biased with regional Arabic affairs. We take for instance the feature: Iraq. This feature has the highest frequency (weight). However, it is not as important as the feature United Nations to international affairs.

 

We explain the poor results that we obtained for topics International/World and Arabic/Middle East by the usage of the same vocabulary especially named entities. Mostly, there are many locations, organizations and names of persons present in the feature set and the training corpus of both topics. In addition, despite the fact that international/world posts discuss wide assortment of topics compared to Arabic/Middle East posts, the features extracted from both topics tend to not differ from each others. They address mainly the same domain: political news.

Comments (0)

You don't have permission to comment on this page.