Arabic weblogs wiki

 

Topic tracking in Arabic blogs

Page history last edited by woiyl 2 yrs ago

Topic tracking in Arabic blogs

 


 

Introduction

 

 

Blogs are strongly related to time since they contain posts in reverse chronological order, they are updated frequently and they always keep the history of their entries. These properties can be very useful to discover the development of discussions in blogs during a period of time. Besides, they can be crucial to identify trends of certain topics in blogs. Therefore, topic tracking in blogs has been actively examined in the previous studies [5, 6, 67, 68, 69].  For Arabic blogs, we want also to address topic tracking using blog posts.

At the beginning of this study, we were not planning to investigate topic tracking methods in Arabic blogs. However, the necessity to provide this kind of experiment has emerged after we started actively using the system for building the blogs dataset. Then, we noticed once that the amount of posts updated daily has begun to develop suddenly with very high tendency during July, 2006. This unexpected increase happened in the same period when the 2006 Lebanon War[1] (the July war) has commenced.  We decided first to look at the posts of that period and see if they are biased with discussions about Lebanon war. Indeed this was the case. The majority of the posts that we assessed were related to Lebanon war issues. All these facts motivated us to make a case study on Lebanon war topic in Arabic blogs.

 

The military action of Israeli troops to south Lebanon started on July 12th till the Lebanon war cease-fire on August 14th, 2006. we noticed during this period that the posting in the blogs dataset augmented drastically (Figure 18). There were two peak values during the war time period. These values showed that the amount of posts has tripled compared to values recorded before (in early June). In addition, these peaks marked the highest values of posting in the summer 2006.  The figure below illustrates the amount of posts noted from early June to late August 2006.

 

 

 

Figure 18: The tendency of posting per day during the summer 2006 

 

The graph above shows a large increase of postings in the period between mid July and mid August 2006. The periods July twelfth to July twentieth and August third to August seventh specified the highest amounts of posts per day (1350 posts per day).

 

In this experiment, we want to find out if the increase of posting co-occurred with the augmentation of usage of Lebanon war related vocabulary in blog posts. Besides, we want to demonstrate that this augmentation was one of the reasons for boosting up the posting during the summer 2006. Therefore, we will build the experimental data by preparing two corpora. After that, we will apply methods to identify and track the language usage during the period of war. Then, we will provide our results and discuss them in the end of the section.

 

Exepriment

 

 

Preparing the dataset

The experimental data was taken from the Arabic blogs dataset. We used the blogs dataset to extract blog posts. After that we used the extracted post to build two corpora:

 

  • Corpus one consists of a set of blog posts that cover the whole period of the 2006 Lebanon war (July 12thAugust 14th 2006), referred to as the sample corpus.

  • Corpus two comprises a set of blog posts that are not associated with the war time period but before and after it. The corpus contains 50% of posts extracted before July 1st and 50% of posts extracted after August 29th, referred to as the standard corpus.

 

To prepare these corpora, we used solutions that we have implemented in early experiments to extract posts from blogs. Then, we verified for each post two essential things: First, we specified the time interval characterizing the post that has to be assigned to the sample corpus and the post that has to be assigned to the standard corpus. Second, before assigning any post to one of the corpora, we used the language categorizer [48] to identify it as post in Arabic language then after we associated it with the corpus. As result, we have built sample corpus and standard corpus with an amount of 3000 and 5000 blog posts correspondingly (Table 26).

 

Table 26: The amount of posts in the sample and the standard corpora

Corpora

Sample corpus

Standard corpus

Amount of Posts

3000

5000

 

 

Experimental settings and results

 

To perform the experiment, we adopted the approach that was published in studies [31, 34]. The first step in the experiment was to demonstrate that the increase in posting was related to the Lebanon war. For this reason, we proceeded to examine methods in order to discover features associated with the war-time period. Features that show a significantly different language usage from that found in the general language. Mainly, we compared word frequencies across sample corpus and standard corpus. Then we applied the log likelihood statistical test [35]. The log likelihood is calculated according to the following formula:

Where:

 : The observed frequency of a term

: The total frequency of a term

: The expected frequency of a term in corpus  and it is determined by the following formula:

 

 

 

 

The index  takes values1 and 2 for the standard and sample corpus respectively.

 

After performing this test, we obtained a list of words. This list represents the overused words in posts during the war time period. The table below describes the words of this list:

 

Table 27: The list of overused words in the blog posts

 

Overused words translated in English

Overused words

 

 

The list shows that overused words are related to Lebanon war issues. The overused words are words with the highest likelihood in this test. The most overused words are cited in the top of the list (Table 27). Among the top five words we find Lebanon, Hezbollah and Israel. These are the principal parties in the Lebanon war conflict. Some of the words in the list are named entities that correspond to locations. For instance; country names like Lebanon and Israel or city names such as Beirut and Qana. Both Israel and Lebanon were involved in the war conflict moreover cities Beirut and Qana were bombed during the war. Other words in the list represent organization names such as Hezbollah (political and paramilitary organization based in Lebanon), or person names like Nasrallah (the secretary General of the Lebanese party Hezbollah). Both words Hezbollah and Nasrallah are associated with the Lebanon war.

 

The log likelihood statistical test demonstrated that the language usage of blogs posts collected between July 12th and August 14th, 2006 is strongly associated with the 2006 Lebanon war. It is biased with war-related vocabulary. This fact implies that the increase in posting happened because the majority of bloggers were actively covering the war issues during the war-time period.    

 

The second step of the experiment was to make clear that the overused words from the list (Table 28) were actively used in blogs during the war-time period. For this purpose, we investigated ways to track these words in the blog posts corpus. Moreover, we tried to find whether the usage of these words has increased during the peak periods of postings (Figure 19). In the beginning, we selected from the blogs dataset all the blogs that were updated during the period between June and August 2006. Then, we processed these blogs and extracted the posts that belong to the same period. After that, we classified these posts in categories by looking at their publication dates. This was done following a simple approach which is to assign the posts published in the same day to the same category. As a result, we obtained sets of blog posts covering all the time period of the Lebanon war. Together with this data, we have chosen some overused words from the list. These words were chosen manually according to their position in the list and their connection with the Lebanon war related terminology.

 

Table 28: Words selected to the tracking in the blog posts

 

Tracked words translated in English

Tracked words

 

 

Further, we proceeded to track these words using the frequency of their occurrence in the each-day blog posts corpus. Then after, we demonstrated the frequencies illustrated in the following graph:

 

 

Figure 19: The frequency of word occurrences in posts measured per day 

 

 

According to the plot, we observe that the frequencies of occurrence have significantly increased between mid-July and late August. These frequencies were nearly absent in June.  We notice that there are two peak periods. The period between July thirteenth  and July eighteenth where words like “Hezbollah”, “Lebanon”, “Israel” and “War” have reached the highest frequency values (more than 90 word occurrences per day). Besides, the period between July twenty-sixth and August seventh is also  marked with high frequency values for words “Lebanon”, “Hezbollah”, “resistance”, “soldiers” and “Israel” (more than 70 word occurrences per day).

In accordance with these findings we confirm that the overused words from the list were actively used in blogs during the war-time period and nearly not used before it. Moreover, we also find out that the usage of these words was more frequent during the peak periods of postings.

 

In this section, we have performed experiments in order to see whether the Lebanon war theme had impact on the amount of posting in the blogs dataset. To achieve that goal, we investigated the amount of posting before and during the war-time and we attempted to find out if the language usage in blog posts during that period has been affected by Lebanon war topic. The results demonstrated that the increase of posting during the period July 12th – August 14th happened as the consequence of the increased interest in blogging about Lebanon war.

 

 

 

Comments (0)

You don't have permission to comment on this page.