DISTRIBUTED Analytics system for arabic search engines users' data using ASSOCIATIVE Classification mining

Ramzi Alqrainy

Opensooq.com

agenda

· Motivation

· Contributions

· DAS Architecture

(Components, Algorithms, and Analysis)

· Experimentation Setup and Results

· The Performance of DAS Under Failure Scenarios

· Conclusion and Future Work

· References

Data is the new oil

Today, the information stored in digital data archives is enormous and its size is still growing very rapidly.

DATA IS THE NEW OIL

The main problem is not storing DATA, it is analyzing, mining, and processing DATA . [Talia, 2012]

Bigger and more complex problems must be solved by distributed computing.

DAS for arabic search engine

is the study of the behavior of searcher.

Main Features :

Measurement of Arabic Search Engines (Precision, Recall, and F-measure)

Personalized Arabic Search

Statistical Report Analysis based on Associative Classification Mining (ACM)

Contributions

Designing and implementing a distributed system for analyzing big Arabic data.

Implementing Arabic Statistical Report Analysis based on Associative Classification Mining (ACM)

Implementing algorithms for pre-processing Arabic events

Normalization
Light Stemming
Stopwords
Synonyms

Handling some server failure scenarios .

Contributions (cont.)

Using Apache Lucene and Apache Zookeeper.

Evaluating The speed of DAS analytically .

Comparing with Elasticsearch in terms of Index Size, Speed, Throughput, CPU Utilization, RAM Usage, Transfer Rate, and Failed Requests .

Studying the DAS speedup .

DAS ARCHITECTURE

DAS Architecture

DAS consists of 3 subsystems :

1. Logging and Archiving Sub system (LAS)

2. Analytics Sub system (AS)

3. User Interface (UI)

LAS

Written in Python and SQLite.

The events are stored in separate table, one table per day.

AS

Distributed system written in JAVA 8

Based on Apache Lucene.

Uses Apache Zookeeper

Preprocessing Arabic Events

There are 2 main processes : Distributed Indexing and Distributed Requesting.

Apache lucene

Apache Lucene is a free/open source information retrieval software library, originally written in Java by Doug Cutting.

Lucene Index Overview [McCandless, 2010]

A Lucene index covers a set of documents
A document is a sequence of fields
A field is a sequence of terms
A term is a text string
A Lucene index consists of one or more segments
Each segment covers a set of documents
Each segment is a fully independent index

Classification by Association Rule Analysis on lucene

There are 2 main methods are implemented in Lucene

Classification-Based Association (CBA)

Classification based on Multiple Association Rules (CMAR)

Preprocessing Arabic events

Arabic normalization and light stemming alg.

Remove punctuation marks.
Remove diacritics (such as شده ، ضمه ، فتحه).
Remove non letters .
Replace أ،إ،ا with ا.
Replace final ى with ي.
Replace final ةwith ه.
Remove prefix فال ، و، ال ، وال ، بال .

Apache zookeeper

ZooKeeper is a service for maintaining configuration information, naming, and providing distributed synchronization.

DAS uses ZooKeeper as a system of record for the cluster state, for central config, and for leader election. [Junqueira, 2013]

AS with zookeeper

AS core

Distributed indexing and requesting

Distributed Indexing

Distributed Requesting

Analysis of distributed index and requesting

ANALYSIS OF DISTRIBUTED INDEX AND REQUESTING

Distributed Indexing

Distributed Requesting

UI

The user interface subsystem is an interface that allows the business owner to deal with statistical data.

Written in PHP and Yii framework.

Support Responsive Web Design (RWD) using Twitter Bootstrap

Experimentation setup and results

setup environment

Data from Opensooq.com

The Analytics Subsystem is implemented in a distributed fashion on 4 identical servers, 2 leaders and 2 replicas.

These servers were provided from Amazon Elastic Compute Cloud (Amazon EC2) which provides a resizable compute capacity in the cloud locates in Oregon with further information.

ElasticSearch (ES)

Elasticsearch is a flexible and powerful open source, distributed, real-time search and analytics engine.

Elasticsearch uses Lucene under the covers to provide the most powerful full text search capabilities available in any open source product. [Kuc, 2013]

ES Case Studies

The metrics we compared das with es

Index Size
Speed
Throughput
CPU Utilization
RAM Usage
Transfer Rate
Failed Requests

index size

speed

throughput

CPU and Ram utilization

Transfer rate

Failed Requests

The performance of DAS under Failure scenarios

FAILURE Scenarios

We have taken 4 scenarios to test the availability and fault tolerance :

Leader 1 Failure
Leader 1 and Leader 2 Failure (i.e. both leaders failures only)
Leader 1 and Replica 2 Failure (i.e. mismatching leader and replica failure).
Replica 1 and Replica 2 Failure (i.e. the failure of both replicas).

Speed in the Presence of Failures

Percentage of Failed requests in the Presence of Failures

throughput

Why go with a distributed solution ?

speedup

CONCLUSION AND FUTURE WORK

CONCLUSION

DAS’s index size was 24% smaller than ES’s index.

The time per request achieved by DAS was 21% faster than ES’s time.

ES turned out to be more memory efficient and used 67% of the memory used by DAS on average

ES’s CPU consumption was 2.4 times of the CPU consumed by DAS

Future work

Distributed index with Apache Hadoop .

DAS with Redis .

REFERENCES

Junqueira Flavio and Reed Benjamin . (2013) ,ZooKeeper: Distributed process coordination. (1st ed). USA. O'Reilly Media. 20-134.
Kuc Rafal and Rogozinski Marek. (2013), Mastering ElasticSearch. (1st ed).UK. Packt Publishing Ltd, (ppt. 45-156).
McCandless Michael, Hatcher Erik, Gospodnetic Otis. (2010) Lucene in Action. (2nd ed), U.K. Manning Publications.
Talia Domenico. (2012) Distributed Data Mining Tasks and Patterns as Services. DPA workshop.

DISTRIBUTED Analytics system for arabic search engines users' data using ASSOCIATIVE Classification mining

agenda

Data is the new oil

DATA IS THE NEW OIL

DAS for arabic search engine

Contributions

Contributions (cont.)

DAS ARCHITECTURE

DAS Architecture

LAS

AS

Apache lucene

Classification by Association Rule Analysis on lucene

Preprocessing Arabic events

Arabic normalization and light stemming alg.

Apache zookeeper

AS with zookeeper

AS core

Distributed indexing and requesting

Distributed Indexing

Distributed Requesting

Analysis of distributed index and requesting

ANALYSIS OF DISTRIBUTED INDEX AND REQUESTING

UI

Experimentation setup and results

setup environment

ElasticSearch (ES)

ES Case Studies

The metrics we compared das with es

index size

speed

throughput

CPU and Ram utilization

Transfer rate

Failed Requests

The performance of DAS under Failure scenarios

FAILURE Scenarios

Speed in the Presence of Failures

Percentage of Failed requests in the Presence of Failures

throughput

Why go with a distributed solution ?

speedup

CONCLUSION AND FUTURE WORK

CONCLUSION

Future work

REFERENCES

Thank you for your attention

DAS

More from Ramzi Alqrainy