Data-Driven Solutions for Trend Mining in Open Big Data: A Multi-Domain
Approach

Dima Kagan

32

2008-?

This presentation

 

 

by GPT-Chat.

WAS

MADE

This presentation

 

 

by GPT-Chat.

WAS

MADE

not

Me and MY Research

  • Make a positive impact
  • Can make someone's life better
  • Has the potential to change policies
  • Can inspire more research
  • Can change regulations
  • Practical!

If what you are working on is not important, and it is not likely to lead to important things, why are you working on it?

Richard Hamming

 

The data revolution

 DATA

DRIVEN

SOLUTIONS

Zooming Into Video Conferencing Privacy

Dima Kagan, Galit Fuhrmann Alpert, Michael Fire

Video conferencing Privacy & Security

Information Leakage

 

 

Malware Attacks

 

 

Data Breach

 

Fake Avatars

 

 

Phishing Attacks

 

 

Zoombombing

Multiparty Privacy

Credit: Such, Jose M., and Natalia Criado. "Multiparty privacy in social media." Communications of the ACM 61.8 (2018): 74-81.

89,305 Zoom related tweets

90,395 Zoom related Instagram posts

16,133 video conferncing images 

26,408 images from Twitter

78,435 images from Instagram

A video conferencing detection model was trained based on ResNet-50.

Accuracy = 0.969, TPR= 0.935, FPR =  0.016

89,305 Zoom related tweets

90,395 Zoom related Instagram posts

16,133 video conferncing images

26,408 images from Twitter

78,435 images from Instagram

A video Conferencing detection model was trained based on ResNet-50.

Accuracy = 0.969, TPR= 0.935, FPR =  0.016

Duplicates removed using dhash

Duplicates removed using the classifier last layer

15,709 unique images

  1. Face Detection - MTCNN + Azure Face API
  2. Face Embedding - dlib 128 dim vector
  3. Age detection = (Resnet-50 based model + Azure Face API)/2
  4. Gender Detection - Azure Face API
  5. Username Recognition - EAST  for detecting text location and MORAN  for subsequently recognizing the composing characters.
    Lemmatization was used to filter dictionary words that are not names.
    Separate words were merged into names based on their distance.
  6. Background detection - An Faster RCNN was trained to detect each participant screen
  • A social graph was created by using face embedding.
  • Each user was connected to all other participants in the same meeting and to all his other meetings.
  • Also, a linkage can be done by the name and similar background.

Automatic Large Scale Detection of Red Palm Weevil Infestation using Street View Images

Dima Kagan, Galit Fuhrmann Alpert, Michael Fire

ISPRS Journal of Photogrammetry and Remote Sensing

Red Palm Weevil

South American Palm Weevil

Palm Weevils

Palm Weevil is a type of beetle that infests palm trees, lays eggs and feeds on  the palm's tissue, creating tunnels inside the tree trunk that weaken its structure, causing extensive damage resulting in tree decline and eventual breakage. The beatles can travel up to 50 km a day, rapidly spreading geographically and raising tremendous risks to spatially widespread locations.

 

The Problem

  • Today, the Red Palm Weevil has spread to 85 different countries and regions worldwide.
     
  • In fact, the Food and Agriculture Organization of the United Nations estimates that in 2023, the combined cost of pest management and replacement of damaged palm trees, in Spain and Italy alone, will reach 200 million Euro.

Photo taken by Omer Tsur - AirWorks aerial photography

Palm Weevils Life Cycle

How Does It Look

How Does It Look

Existing Solutions

Publicly Available Data + Computer Vision

●Applicable at large scale.

●Relatively affordable.

●Can be used on any point in the world.

●Can be used anywhere especially in urban areas.

Street View

  • Provide 360 degree images.

  • Contain historical data.

  • World wide availability.

  • Many vendors:

    • Google

    • Microsoft

    • Yandex

    • Apple

    • Kartaview

    • etc.

Google Street View

  • 0.007 USD per each (7.00 USD per 1000).

  • Each image contains up to 120 degrees out of the panorama.

  • Max image size is 640x640.

  • We estimate that fully mapping of San Diego area using only street view images should cost around 51,000$.

Aerial Imagery

  • Cheaper, sometimes even free.
  • Show larger areas than street view.

Aerial Palm Detection

\theta=atan2(y_{s}-y_{a}, x_{s}-x_{a})
(x_{s}, y_{s})
(x_{a}, y_{a })
\theta

Faster R-CNN model was trained to detect palm trees for aerial imagery with mAP of 0.5.

Palm Tree Detection & Classification

(x_{s}, y_{s})
(x_{a}, y_{a })

N

Faster R-CNN (mAP 0.9)

Palm Tree Detection & Classification

  • XResNet model was trained to detect infested palm trees.
  • Initially, Binary Cross Entropy loss was used to filter unknown labels. However, it did not perform as well as accepted.
  • A new "Unkown" class was added to the training data.
  • As for out-of-domain data:
    • We manually added street view data.
    • Under-sampled uniformly 8 images of each class from Caltech 101.  
  • The infested palm tree data was oversampled from 70 to 892.
  • The palm tree health classifier achieved an F1 score of 0.84, precision of 0.83, recall of 0.85, and AUC of 0.948.

Street VS Aerial Mapping

  • 546 aerial images.

  • 756 aerial detected palm trees.

  • 4,544 street level images.

  • 72% match between aerial images and street level detected palm trees.

Normal Heights Village, San Diego

Case Study  

  • Has a Palm Weevil problem.

  • Has ground truth data of infested palm trees (reports).

  • Has historical data in Google Street View.

Analysis

  • Scanning the area of the ground truth data we detected 5,008 trees.
  • Our classifier classified five of the six ground truth trees as infested.
  • Additionally, out of 5,008 detected palms, in the classifier detected additional 13 infested palm trees, of which we identified eight at advanced infestation stages.

Searching on a Large Scale

22,438 aerial images

54,781 street view images

 36,001 palm trees

109 were classified as infested

24 were infested at an advanced stage

Jun 2014

Nov 2017

April 2018

April 2019

Jun 2014

Nov 2017

April 2018

April 2019

Jun 2014

Nov 2017

April 2018

April 2019

Jun 2014

Nov 2017

April 2018

April 2019

Limitations & Challenges

  • Different species of palms trees has different symptoms.
  • Fertilizer may delay the visual appearance of the symptoms.
  • Aerial images are not taken at the same time that of street view images.
  • Street view images are not taken at constant time intervals, and only at sporadic time points.
  • When classifying vegetation there may be external or seasonal variables that can affect how the plant looks and as result influence the classifier.
  • The current method and data do not support the detection of borderline cases or detecting the severity of the infestation.

Future Directions

Future Directions

Future Directions

Future Directions

Dima Kagan, Thomas Chesney, Michael Fire

Humanities and Social Sciences Communications

 

USING DATA SCIENCE TO UNDERSTAND GENDER GAP IN THE FILM INDUSTRY

Data SCIENCE

 

 

Network Science

 

Constructing Movie Social Networks

Matching entities in the movie subtitles with the characters

 

 

Data: PersonName, Roles, Threshold

 

Result: Matched character

Names ← PersonName.split();

foreach Ni ∈ Names do

 if Roles[Ni].length = 1 then

  return Roles[Ni];

 end

 return Max(WeightedRatio (PersonName, Roles[Ni], Threshold))

end

Constructing Movie Social Networks

Constructing Movie Social Networks

W=W+1

Challenges

 

Am I to understand you're still hanging
around with Dr. Emmett Brown, McFly?

Marty McFly

Lorraine McFly

George McFly

15,540

Subtitles

Cast Data

Cast Data

EValuation

 
  • On average, Subs2Network had more central characters than ScriptNetwork from the top-10 most central characters
  • In terms of edge coverage, we found that Subs2Network covered 65.4% of the edges in ScriptNetwork networks and ScriptNetwork covered 65.1% of the edges in Subs2Network networks.

 

ScriptNetwork

Subs2Network
Top-5 2.70 2.80
Top-10 5.35 6.06

EValuation - Amazon Xray

 
  • We observed that Subs2Network matched X-Ray nodes and edges at 79.6% and 54.5%, respectively.
  • Analyzing character matching by screen time, we found that we could detect main characters with a high accuracy of up to 96.4%.

Who heard about the coronavirus before 2019?

Scientometric Trends for Coronaviruses and Other Emerging Viral Infections

Dima Kagan, Jacob Moran-Gilad, Michael Fire

GigaScience

Motivation

  • In the past hundred years, the human population quadrupled and the word became more connected than ever before.

 

 

  • 9,728                         1,270,406

     
  • The potential for disease spread is immense.

Brockmann, D., & Helbing, D. (2013). The hidden geometry of complex, network-driven contagion phenomena. science, 342(6164), 1337-1342.

Datasets

  • Microsoft Academic Graph - 210 million papers
  • PubMed - 29 million medical papers
  • SJR - 34,000 journals
  • Wikidata - 78 million items

Research Questions

  • To what extent were the previous human coronavirus (SARS and MERS outbreaks studied?
  • Is research on emerging viruses being sustained, aiming to understand and prevent future epidemics?
  • Are there lessons from academic publications on previous emerging viruses that could be applied to the current COVID-19 epidemic?

Normalized Paper Rate

The ratio between the Number of Papers  published on a specific infectious disease d to the total number of papers in the fields of medicine or biology in the year y

NPR_{y}(d) = \frac{|\{ i \in P | i_{Year}=y \&i_{Disease}=d\}|}{|\{i \in P | i_{Year}=y\}|}

Normalized Citation Rate

The ratio between the Number of Citations on a specific infectious disease d and the total number of citations about medicine or biology in year y

NCR_{y}(d) = \frac{\sum_{\{i \in P | p_{Year}=y \& i_{Disease}=d\}}\sum_{\{j \in P \}} j \in i_{citations}}{|\{j \in P | j_{Year}=y\} j_{citations}|}

Author trends

Disease Median Experience in Years
SARS 4
Avian Influenza 5
Swine Flu 5
Hepatitis B 5
Ebola 5
Influenza 6
HIV/AIDS 7
MERS Coronavirus 7
Hepatitis C 8
Disease Papers
Swine Flu 3.45
SARS 3.84
MERS Coronavirus 3.86
Ebola 4.07
Hepatitis B 4.42
Avian Influenza 4.47
Influenza 5.04
Hepatitis C 5.24
HIV/AIDS 6.31

Publications

  • Dima Kagan, Galit Fuhrmann Alpert, and Michael Fire. “Zooming into video conferencing privacy and security threats”. In: IEEE Transactions on Computational Social Systems (2022)
  • Dima Kagan, Galit Fuhrmann Alpert, and Michael Fire. “Automatic large scale detection of red palm weevil infestation using street view images”. In: ISPRS Journal of Photogrammetry and Remote Sensing 182 (2021), pp. 122–133
  • Dima Kagan, Jacob Moran-Gilad, and Michael Fire. “Scientometric trends for coronaviruses and other emerging viral infections”. In: GigaScience 9.8 (Aug. 2020).
  • Dima Kagan, Thomas Chesney, and Michael Fire. “Using data science to understand the film industry‘s gender gap”. In: Palgrave Communications 6.1 (2020), pp. 1–16.ima Kagan, Thomas Chesney, and Michael Fire. “Using data science to understand the
    film industry‘s gender gap”. In: Palgrave Communications 6.1 (2020), pp. 1–16.

Publications

  • Dima Kagan, Michael Fire, and Galit Fuhrmann Alpert. “Trends in Computer Science Research within European Countries: Independent and Collaborative Contributions”. In: Communications of the ACM (2022).
  • Shay Lapid, Dima Kagan, and Michael Fire. “Co-Membership-based Generic Anomalous Communities Detection”. In: Neural Processing Letters (2022).
  • Michael Fire, Rami Puzis, Dima Kagan, and Yuval Elovici. “Large-Scale Shill Bidder Detection in E-commerce”. The 27th International Database Engineered Applications Symposium, 2023.

UNDER REVIEW

  • Dima Kagan, Mor Levy, Michael Fire, and Galit Fuhrmann Alpert. “Ethnic Representation Analysis of Commercial Movie Posters”. 2022.
  • Horowitz, Shmuel, Dima Kagan, Galit Fuhrmann Alpert, and Michael Fire. "Interruptions detection in video conferences." 2023

Thank you questions?

My Research Detailed

By Dima Kagan

My Research Detailed

  • 67