DISTRIBUTED Analytics system  for arabic search engines users' data using ASSOCIATIVE Classification mining 

Ramzi  Alqrainy


·  Motivation

·  Contributions

·  DAS Architecture 

 (Components, Algorithms,  and Analysis)

·  Experimentation Setup and Results 

·  The Performance of DAS Under Failure Scenarios

·  Conclusion and Future Work

·  References 

Data is the new oil

  • Today, the information stored in digital data archives is enormous and its size is still growing very rapidly.


  • The main problem is not storing DATA, it is analyzing, mining,    and processing DATA . [Talia, 2012]

  • Bigger and more complex problems must be solved by distributed computing.

DAS for arabic search engine

  • is the study of the behavior of searcher. 

  • Main Features :

    • Measurement of Arabic Search Engines (Precision, Recall, and F-measure)

    • Personalized Arabic Search

    • Statistical Report Analysis  based on Associative Classification Mining (ACM)


    • Designing and implementing a distributed system for analyzing big Arabic data.

    • Implementing  Arabic Statistical Report Analysis  based on Associative Classification Mining (ACM)

    • Implementing algorithms for pre-processing Arabic events
      • Normalization
      • Light Stemming
      • Stopwords
      • Synonyms

    • Handling some server failure scenarios .

    Contributions (cont.)

    • Using Apache Lucene and Apache Zookeeper.

    • Evaluating The speed of DAS analytically .

    • Comparing  with Elasticsearch in terms of Index Size, Speed, Throughput, CPU Utilization, RAM Usage, Transfer Rate, and Failed Requests .

    • Studying the DAS speedup .


    DAS Architecture

    DAS consists of 3 subsystems : 

    1.  Logging and Archiving Sub system (LAS)

    2.  Analytics Sub system (AS)

    3. User Interface (UI)



    • Written in Python and                                                                                       SQLite.

    • The events are stored in                                                                  separate table, one table per day.


    • Distributed system written in JAVA  8

    •  Based on Apache Lucene.

    • Uses Apache Zookeeper

    • Preprocessing Arabic Events

    • There are 2 main processes : Distributed Indexing and Distributed Requesting.

    Apache lucene

    Apache Lucene is a free/open source information retrieval software library, originally written in Java by Doug Cutting.

    • Lucene Index Overview [McCandless, 2010]
      • A Lucene index covers a set of documents
      • A document is a sequence of fields
      • A field is a sequence of terms
      • A term is a text string
      • A Lucene index consists of one or more segments
      • Each segment covers a set of documents
      • Each segment is a fully independent index

    Classification by Association Rule Analysis on lucene

    • There are 2 main methods are implemented in Lucene

    1. Classification-Based Association (CBA)
    2. Classification based on Multiple Association Rules (CMAR)

    Preprocessing Arabic events

    Arabic normalization and light stemming alg.

    1. Remove punctuation marks.
    2. Remove diacritics (such as شده ، ضمه ، فتحه). 
    3. Remove non letters .
    4. Replace أ،إ،ا with ا.
    5. Replace final ى with ي.
    6. Replace final ةwith ه.
    7. Remove prefix  فال ، و، ال ، وال ، بال .

    Apache zookeeper

    ZooKeeper is a service for maintaining configuration information, naming, and providing distributed synchronization.

    DAS uses ZooKeeper as a system of record for the cluster state, for central config, and for leader election. [Junqueira, 2013]

    AS with zookeeper

    AS core

    Distributed indexing and requesting

    Distributed Indexing

    Distributed Requesting

    Analysis of distributed index and requesting


    • Distributed Indexing

    • Distributed Requesting


    • The user interface subsystem is an interface that allows the business owner to deal with statistical data.
    • Written in PHP and Yii framework.

    • Support Responsive Web Design (RWD) using Twitter Bootstrap

    Experimentation setup and results 

    setup environment

    • Data from Opensooq.com

    • The Analytics Subsystem is implemented in a distributed fashion on 4 identical servers, 2 leaders and 2 replicas. 

    • These servers were provided from Amazon Elastic Compute Cloud (Amazon EC2) which provides a resizable compute capacity in the cloud locates in Oregon with further information.

    ElasticSearch (ES)

    Elasticsearch is a flexible and powerful open source, distributed, real-time search and analytics engine. 

    Elasticsearch uses Lucene under the covers to provide the most powerful full text search capabilities available in any open source product. [Kuc, 2013]

    ES Case Studies

    The metrics we compared das with es

    1. Index Size
    2. Speed
    3. Throughput
    4. CPU Utilization
    5. RAM Usage
    6. Transfer Rate
    7. Failed Requests

    index size



    CPU and Ram utilization

    Transfer rate 

    Failed Requests

    The performance of DAS under Failure scenarios

    FAILURE Scenarios

    • We have taken 4 scenarios to test the availability and fault tolerance :
    1. Leader 1 Failure
    2. Leader 1 and Leader 2 Failure (i.e.  both leaders failures only)
    3. Leader 1 and Replica 2 Failure (i.e. mismatching leader and replica failure).
    4. Replica 1 and Replica 2 Failure (i.e. the failure of both replicas).

    Speed in the Presence of Failures

    Percentage of Failed requests in the Presence of Failures


    Why go with a distributed solution ?




    • DAS’s index size was 24% smaller than ES’s index.

    • The time per request achieved by DAS was 21% faster than ES’s time.

    • ES turned out to be more memory efficient and used 67% of the memory used by DAS on average
    • ES’s CPU consumption was 2.4 times  of the CPU consumed by DAS

    Future work

    • Distributed index with Apache Hadoop . 

    • DAS with Redis .


    • Junqueira Flavio and Reed Benjamin . (2013) ,ZooKeeper: Distributed process coordination. (1st ed). USA. O'Reilly Media. 20-134.
    • Kuc Rafal and Rogozinski Marek. (2013), Mastering ElasticSearch. (1st ed).UK. Packt Publishing Ltd, (ppt. 45-156).
    • McCandless Michael, Hatcher Erik, Gospodnetic Otis. (2010) Lucene in Action. (2nd  ed), U.K. Manning Publications.
    • Talia Domenico. (2012) Distributed Data Mining Tasks and Patterns as Services. DPA workshop.

    Thank you for your attention