Arabic weblogs wiki

 

System for collecting Arabic news content

Page history last edited by Anonymous 2 yrs ago

System for collecting Arabic news content

 


 

Background information

  • System for collecting Arabic news content is a system that is aimed to access Arabic news portals and collect their news content using RSS feed technology. The content in news portals such as Aljazeera.net and BBCArabic.com is found under news categories. Therefore, Our system builds a local news categories repository and stores the news content with regard to each category. Basically, the system works on daily basis. Each time it updates the local feeds index, process the news feeds and extract the links to news content. Afterward, the system processes the news content and stores it locally.

 

System architecture

 

  • There are two modules that realize the main requirements of our system. We define the news feeds module and the news content module. The news feed module is responsible for reading news feeds and processing them in order to obtain Information about the content. The news content module takes information related to lately retrieved content then verifies its existence at local feeds index. In case when this information is missing on the local index, the system requests the content from news portals. This content is then processed and stored locally within the collection of news articles.

 

  • By looking at the system architecture, we are going to explain the steps that the system goes through in order perform the collecting task:

 

  1. The first steps that the system goes through are handled within the News feeds module. This module has two sub-modules feed reader and feeds processing. Initially, feeds reader uses a URL to connect to the main news feeds. After that, it retrieves the URLs of all categories feeds. The categories feeds are then transferred to the processing sub-module.
  2. When the processing sub-module receives URLs of all categories. It gathers them for further use.
  3. Having the URLs grouped. The sub-module queries news sites and retrieves the news feeds for each category. Furthermore, it processes these feeds and gets URLs for categories.
  4. The URLs that are pointing to content are handled within the news content module. This module has two sub-modules that are responsible for collecting, processing, and storing locally the news content.
  5. When the collecting sub-module receives the URLs. It checks out their existence in the local index. The occurrence of these URLs within the index implies that the news content is already saved locally in the news repository. Therefore, no further processing is regarded. When URLs are not listed in the index. Then the system requests the new content.
  6. The system requests content for missing URLs. First, it connects to news sites and requests missing content web pages. Then, it stores this content and updates the local index with new changes.
  7. Processing sub-module processes HTML data that was stored earlier. Basically, it extracts the news content from raw web pages.
  8. In the last step of the collection, the news content module delivers clean content to be stored in the news articles collection.

Comments (0)

You don't have permission to comment on this page.