The Arabic blogs dataset
The Arabic blogs dataset
The dataset consists of approximately 12.060 blogs containing a number of posts that exceeds 120.300. The oldest blog post in the dataset dates back to 2002. The dataset contains also relevant blog data. The blog data consist of information about blogs such as title, description, and URL address. In addition, it consists of blog posts data like author's name, date of publication, and content of the post.
The blogs dataset an item stores a weblog data. In our case, each item has properties related to the weblog itself and its posts. Here in Fig.1 we provide a sample Arabic weblog from Blogger.com. Moreover in Fig. 2 we illustrate the processed from of it. It is contained in structured model which encpasulates all weblog properties.
Fig.1 Arabic weblog sample
Fig.2 the blog in the previous figure converted to a dataset item
Blog properties (in the dataset):
properties provide information about the blog. The first part is associated with the blog information (url, title, description). The second is related to the blog post.
Fig.3 The blog properties
Blog post properties
Blog post properties describe a blog entry. These are post content, publication date and author's name (usually, blogger's name). An item of this data setdoesn'tcontain any information concerning post's comments and their properties.
Fig.4 The blog post properties
The blogging platforms for Arabic blogs
The blogs dataset contains data from various sources. The blogs in the dataset belong to diverse blogging platforms. In total, there are four platforms that contribute to the dataset. Two of them are popular Arabic blogging platforms. These are Maktoob and Jeeran. The two others are non-Arabic platforms, however they give possibilities to create and manage blogs in Arabic. These are Blogger and Microsoft Network (MSN) spaces.
Fig.5 The amount of blog data provided by each blogging platform
The figure above demonstrates that 65% of blogs in the blogs dataset belong to the blogging platform Maktoob. 25% of the dataset is related to the worldwide blogging platform Blogger. Finally, the blogging platform Jeeran and Arabic MSN spaces contribute correspondingly to 2% and 8% in the dataset.
The Language usage in Arabic blogs
Arabic blogs contain posts written in Arabic and non-Arabic languages. After we looked at several blog posts in the dataset, we found that Arabic blogs may contain posts that are not written only in Arabic language, but also in other languages for instance English or French. To highlight this feature in the dataset, we provided measurements representing the percent of blog posts written in languages Arabic, English, and French. For this purpose, we applied method to detect languages in blogs posts. Namely, we used an implementation of the text categorization algorithm presented by Cavnar et al. This implementation is known by Text Language Categorizer (TextCat) . Afterwards, we obtained the results illustrated in Figure 5:
| Posts (in Languages) |
Arabic |
English |
French |
Other |
| Amount of posts in a data set |
36251 |
13954 |
2272 |
4442 |
Table 1 and Fig. 5 show obtained results using a language categorizer TextCat to our data set
- TextCat is an implementation of the text categorization algorithm presented in Cavnar, W. B. and J. M. Trenkle, ``N-Gram-Based Text Categorization'' In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994. For more detail visit: http://www.let.rug.nl/~vannoord/TextCat/
The results show that 63% of posts in the dataset are written in Arabic. This is followed by 25% of the posts written in English and 4% of the posts composed in French. The category “Other” represents 8% of the total blogs dataset. Samples of posts labeled “Other” represent posts in other languages such as Spanish. Besides, posts with text containing hyperlinks and symbols also fall in this category.
We have seen that Arabic bloggers do not publish in the Arabic language only. Bloggers from the Middle East countries sometimes use English to publish posts. Besides, bloggers from North Africa tend more to compose posts in French as an alternative to Arabic. This happens because of many reasons. First, Arab bloggers prefer to write their posts in English in order to target non-Arabic speaking readers with their opinions concerning many issues, especially cultural and political ones. Second, blog service providers giving the possibility to create blogs in Arabic language were not available earlier. These services emerged a couple of years ago. Finally, some Arabic bloggers believe that it is trendy to publish in foreign language (especially English) instead of Arabic.
Statistics over the Arabic blogs dataset
Most blogs in the dataset were published in 2006. The table below describes the amount of new blogs that have been added in the dataset. This amount is provided every year from 2002 to 2006.
Fig 6. The amount of newly added blogs in the dataset (per year)
Fig 6. demonstrates that 80% of new blogs have been added in 2006. They represent the highest amount recorded for the last five years. Furthermore, new blogs added in 2005 represent 16.5% of all blogs. Thus, the majority of blogs in the dataset emerged recently. The figure below shows the amount of new blogs added to the dataset during 2006:
Fig.7 The amount of new blogs added to the dataset (per month)
Most of new blogs appeared in months May, June and July 2006. The highest amount was recorded in June with about 1533 new blogs. This is due to the fact that the system for building the dataset (section 3.5) uses blog syndication feeds. This system began operating successfully in the end of June 2006. It completed its support to all blog service providers in July 2006. That is the reason why the monitoring of blogs operated appropriately and regularly during that period.
On the other hand, blogs in the dataset are frequently updated according to Fig. 8. The Results provided in this figure indicate that the composition rate of posts has largely increased in 2006 compared to 2005.
Fig.6 The amount of newly added blogs in the dataset (per year)
Further, the following figure demonstrates the composition rate in the dataset taken every month from January 2007 till July 2007:
Fig.7 The composition rate of blog posts in the dataset (per month)
We see that the composition rate has been changing during 2006 and the beginning of 2007. This rate was very high during July and August. This happens for two reasons. First, most of the posts that we collected from syndication feeds are dated in 2006. Second, the major events that took place in the Arab world also participated in spreading of blogging during 2006. For instance, the period of Lebanon War (July – August 2006) has boosted the posting to its highest values.
The link structure of blogs in the Arabic blogs dataset
Basic measurements on the link structure of Arabic weblogs collection is very significant. Because it allows us to understand various issues concerning Arabic bloggers communities and their distribution within different Arabic weblogs service providers.
We aim to give simple statistics about the link structure of Arabic blogs data set and find out which blogging platforms host the most out-liniking blogs. Firstly, we started measurements by collecting for each blog all its out-links and structring it as shown in Fig. 8:
Fig.8 XML showing the structure of blog with out-links
Afterward, we took randomly 100 blogs from our data, we counted manually the number of out-links for each one and we calculated the average of out-links for the selected blogs. The result was:
Avg. out-links for 100 weblogs from the Arabic data set: 8.92 link per blog
Further, we made another measurement to calculate the out-links for each 100 weblogs given the following blog service providers: Maktoob, Jeeran, Blogspot, Msn spaces. The table below shows the results:
| Blog service provider |
Avg. number of out-links |
Min. number of links |
Max. number of links |
| Maktoob |
0.43 |
0 |
4 |
| Jeeran |
3.58 |
0 |
26 |
| Msn spaces |
9.91 |
0 |
358 |
| Blogspot |
12.07 |
0 |
125 |
Table 2. Average number of out-links for blog service providers
The results demonstrate that both Blogspot and Msn spaces have blogs that are rich on out-links; they have correspondingly 12 and 9.9 links per blog. Jeeran have shown an average of 3.5 link for each blog and Maktoob provided a poor out-link average of 0.43.Trying to figure out the poor results of Maktoob, we provide another table that provides the percent of single blogs for each blog service provider given early experiment settings.
| Blog service provider |
Percent of single blogs |
Percent of non-single blogs |
| Maktoob |
72% |
28% |
| Jeeran |
46% |
54% |
| Msn spaces |
32% |
68% |
| Blogspot |
24% |
76% |
Table 3. The percent of single blogs and non-single blogs for blog service providers
Table 3 shows that 72% of 100 blogs chosen randomly from blog data set belonging to Maktoob blogs are single blogs. This result explains the poor out-link average related to that blogs service provider. In the other cases (for Jeeran, Msn Spaces and Blogspot) more than half of the blogs are not single, this result explains the considerable value of out-links that we obtained earlier. In conclusion, Advanced studies on the link structure in the Arabic weblogs data set is recommended for data collected from weblog services providers such as Blogspot, Msn spaces and Jeeran.
Summary
We provided various measurements on the Arabic blogs dataset where we aim to characterize the Arabic blogs. We believe that our measurements are helpful to emphasize certain characteristics of the Arabic blogs based on the following reasons:
- The size of the Arabic blogs dataset is considerable since it represents 30% of the Arabic blogspace (40.000 according to HRInfo).
- The growth of the Arabic blogs dataset is continuous. Approximately, the average growth in 2006 was 750 new blog per month with an average of 400 new blog posts per month in the whole collection.
- The Arabic blogs dataset is associated with popular blogging platforms, and some of these platforms are the leaders in providing blogging services in the Arab world, for instance: Maktoob and Jeeran.
- The language usage in Arabic blogs varies. The Arabic blogs dataset captures blogs of Arabic bloggers writing in Arabic and Non-Arabic languages.
- The Arabic blogs dataset contains posts treating variety of topics catching Arabic and Non-Arabic affairs, besides issues covering personal and general interests of individuals.
- The Arabic blogs in the dataset represent one or more blog communities in the Arabic blogspace.
Download the Arabic blogs dataset
To download the Arabic blogs dataset from the page of Arabic blogs at the ILPS group web page, University of Amsterdam. Click here
You can also download the dataset from my home page by clicking to the following link
Comments (0)
You don't have permission to comment on this page.