DISTRIBUTED Analytics system  for arabic search engines users' data using ASSOCIATIVE Classification mining 



Ramzi  Alqrainy
Opensooq.com


agenda


·  Motivation

·  Contributions

·  DAS Architecture 

 (Components, Algorithms,  and Analysis)

·  Experimentation Setup and Results 

·  The Performance of DAS Under Failure Scenarios

·  Conclusion and Future Work

·  References 


Data is the new oil

  • Today, the information stored in digital data archives is enormous and its size is still growing very rapidly.


DATA IS THE NEW OIL


  • The main problem is not storing DATA, it is analyzing, mining,    and processing DATA . [Talia, 2012]
 

  • Bigger and more complex problems must be solved by distributed computing.

DAS for arabic search engine


  • is the study of the behavior of searcher. 

  • Main Features :

    • Measurement of Arabic Search Engines (Precision, Recall, and F-measure)


    • Personalized Arabic Search

    • Statistical Report Analysis  based on Associative Classification Mining (ACM)

    Contributions

    • Designing and implementing a distributed system for analyzing big Arabic data.

    • Implementing  Arabic Statistical Report Analysis  based on Associative Classification Mining (ACM)


    • Implementing algorithms for pre-processing Arabic events
      • Normalization
      • Light Stemming
      • Stopwords
      • Synonyms

    • Handling some server failure scenarios .

    Contributions (cont.)


    • Using Apache Lucene and Apache Zookeeper.

    • Evaluating The speed of DAS analytically .

    • Comparing  with Elasticsearch in terms of Index Size, Speed, Throughput, CPU Utilization, RAM Usage, Transfer Rate, and Failed Requests .

    • Studying the DAS speedup .




    DAS ARCHITECTURE 

    DAS Architecture


    DAS consists of 3 subsystems : 

    1.  Logging and Archiving Sub system (LAS)


    2.  Analytics Sub system (AS)


    3. User Interface (UI)





     

    LAS





    • Written in Python and                                                                                       SQLite.


    • The events are stored in                                                                  separate table, one table per day.

    AS


    • Distributed system written in JAVA  8

    •  Based on Apache Lucene.

    • Uses Apache Zookeeper

    • Preprocessing Arabic Events

    • There are 2 main processes : Distributed Indexing and Distributed Requesting.

    Apache lucene


    Apache Lucene is a free/open source information retrieval software library, originally written in Java by Doug Cutting.

    • Lucene Index Overview [McCandless, 2010]
      • A Lucene index covers a set of documents
      • A document is a sequence of fields
      • A field is a sequence of terms
      • A term is a text string
      • A Lucene index consists of one or more segments
      • Each segment covers a set of documents
      • Each segment is a fully independent index

    Classification by Association Rule Analysis on lucene


    • There are 2 main methods are implemented in Lucene

    1. Classification-Based Association (CBA)
    2. Classification based on Multiple Association Rules (CMAR)

    Preprocessing Arabic events


    Arabic normalization and light stemming alg.


    1. Remove punctuation marks.
    2. Remove diacritics (such as شده ، ضمه ، فتحه). 
    3. Remove non letters .
    4. Replace أ،إ،ا with ا.
    5. Replace final ى with ي.
    6. Replace final ةwith ه.
    7. Remove prefix  فال ، و، ال ، وال ، بال .

    Apache zookeeper


    ZooKeeper is a service for maintaining configuration information, naming, and providing distributed synchronization.


    DAS uses ZooKeeper as a system of record for the cluster state, for central config, and for leader election. [Junqueira, 2013]

    AS with zookeeper

    AS core





    Distributed indexing and requesting

    Distributed Indexing




    Distributed Requesting





    Analysis of distributed index and requesting

    ANALYSIS OF DISTRIBUTED INDEX AND REQUESTING

    • Distributed Indexing



    • Distributed Requesting


    UI


    • The user interface subsystem is an interface that allows the business owner to deal with statistical data.
     
    • Written in PHP and Yii framework.
     

    • Support Responsive Web Design (RWD) using Twitter Bootstrap







    Experimentation setup and results 

    setup environment

    • Data from Opensooq.com

    • The Analytics Subsystem is implemented in a distributed fashion on 4 identical servers, 2 leaders and 2 replicas. 

    • These servers were provided from Amazon Elastic Compute Cloud (Amazon EC2) which provides a resizable compute capacity in the cloud locates in Oregon with further information.


    ElasticSearch (ES)


    Elasticsearch is a flexible and powerful open source, distributed, real-time search and analytics engine. 


    Elasticsearch uses Lucene under the covers to provide the most powerful full text search capabilities available in any open source product. [Kuc, 2013]


    ES Case Studies


    The metrics we compared das with es


    1. Index Size
    2. Speed
    3. Throughput
    4. CPU Utilization
    5. RAM Usage
    6. Transfer Rate
    7. Failed Requests

    index size


    speed




    throughput




    CPU and Ram utilization



    Transfer rate 




    Failed Requests






    The performance of DAS under Failure scenarios

    FAILURE Scenarios


    • We have taken 4 scenarios to test the availability and fault tolerance :
    1. Leader 1 Failure
    2. Leader 1 and Leader 2 Failure (i.e.  both leaders failures only)
    3. Leader 1 and Replica 2 Failure (i.e. mismatching leader and replica failure).
    4. Replica 1 and Replica 2 Failure (i.e. the failure of both replicas).

    Speed in the Presence of Failures



    Percentage of Failed requests in the Presence of Failures


    throughput






    Why go with a distributed solution ?

    speedup






    CONCLUSION AND FUTURE WORK

    CONCLUSION 



    • DAS’s index size was 24% smaller than ES’s index.
     

    • The time per request achieved by DAS was 21% faster than ES’s time.

    • ES turned out to be more memory efficient and used 67% of the memory used by DAS on average
     
    • ES’s CPU consumption was 2.4 times  of the CPU consumed by DAS

    Future work



    • Distributed index with Apache Hadoop . 

    • DAS with Redis .


    REFERENCES 


    • Junqueira Flavio and Reed Benjamin . (2013) ,ZooKeeper: Distributed process coordination. (1st ed). USA. O'Reilly Media. 20-134.
    • Kuc Rafal and Rogozinski Marek. (2013), Mastering ElasticSearch. (1st ed).UK. Packt Publishing Ltd, (ppt. 45-156).
    • McCandless Michael, Hatcher Erik, Gospodnetic Otis. (2010) Lucene in Action. (2nd  ed), U.K. Manning Publications.
    • Talia Domenico. (2012) Distributed Data Mining Tasks and Patterns as Services. DPA workshop.




    Thank you for your attention