Malicious Profile Identification in Online Social Networks

Dima Kagan

Supervisors: MICHAEL FIRE, YUVAL ELOVICI

Complex Networks

Related Work

Reputation based filtering [Golbeck and Hendler].
Topoplogy based identification [Fire et al.].
Graph centrality measure based spammer identification [DeBarr and Wechsler].
Spammers detection in social networks by using “honey-profiles" [Stringhini et al.].
Clustering groups of accounts that act similarly at around the same time for a sustained period of time [Cao et al.].

Link Prediction

Crowd Wisdom

Supervised Fake Profile
Identification in Online Social Networks

Facebook App

\begin{aligned} CS(u,v)= & Common-Friends(u,v) \\ & +Common-Chat-Messages(u,v) \\ & +2\cdot Common-Groups-Number(u,v) \\ & +2\cdot Common-Posts-Number(u,v) \\ & +2\cdot Tagged-Photos-Number(u,v) \\ & +2\cdot Tagged-Videos-Number(u,v) \\ & +1000\cdot Are-Family(u,v) \end{aligned}

connection strength heuristic

Browser addon

Architecture

Collected Data - Facebook App

Are-Family(u,v)
Common-Friends(u,v)
Common-Groups-Number(u,v)
Common-Posts-Number(u,v)
Common-Chat-Messages(u,v)
Tagged-Photos-Number(u,v)
Tagged-Videos-Number(u,v)
Friends-Number(u)
Friends-Number(v)

COLLECTED DATA - ADDON

Installed-Application-Number
Default-Privacy-Settings
Lookup
Share-Address
Send-Messages
Receive-Friend-Requests
Tag-Suggestions
View-Birthday

Fake profiles dataset Recommended restricted links set + All unrestricted links set.

ML Datasets

Friends restriction dataset Alphabetically restricted links set + All unrestricted links set.

All links dataset

Contains all the links.

	Users	Restricted	Unrestricted
Fake-Profiles	434	2,860	138,286
Friends Restrictions	355	6,145	138,286
All Links	527	9,005	138,286

COLLECTED DATA

Additional Features

Common-Groups-Ratio(u,v)
Common-Posts-Ratio(u,v)
Common-Chat-Messages-Ratio(u,v)
Common-Photos-Ratio(u,v)
Common-Videos-Ratio(u,v)
Is-Friend-Profile-Private(v)
Jaccard's-Coefficient(u,v)

Classifier	Measure	Fake Profiles	Friends Restriction	All Links
OneR	AUC	0.861	0.511	0.608
OneR	F-Measure	0.867	0.531	0.616
OneR	False-Positive	0.179	0.532	0.414
OneR	True-Positive	0.902	0.554	0.623
J48	AUC	0.925	0.684	0.72
J48	F-Measure	0.885	0.668	0.659
J48	False-Positive	0.179	0.498	0.321
J48	True-Positive	0.937	0.754	0.654
IBK (K=10)	AUC	0.833	0.587	0.545
IBK (K=10)	F-Measure	0.744	0.49	0.637
IBK (K=10)	False-Positive	0.174	0.289	0.749
IBK (K=10)	True-Positive	0.696	0.419	0.817
Naive-Bayes	AUC	0.902	0.73	0.75
Naive-Bayes	F-Measure	0.833	0.677	0.675
Naive-Bayes	False-Positive	0.373	0.403	0.3
Naive-Bayes	True-Positive	0.979	0.717	0.662
Bagging	AUC	0.946	0.698	0.728
Bagging	F-Measure	0.89	0.645	0.657
Bagging	False-Positive	0.171	0.403	0.312
Bagging	True-Positive	0.941	0.671	0.643
AdaBoostM1	AUC	0.937	0.698	0.728
AdaBoostM1	F-Measure	0.882	0.645	0.657
AdaBoostM1	False-Positive	0.163	0.403	0.312
AdaBoostM1	True-Positive	0.941	0.671	0.643
Rotation-Forest	AUC	0.948	0.79	0.778
Rotation-Forest	F-Measure	0.897	0.719	0.696
Rotation-Forest	False-Positive	0.158	0.336	0.275
Rotation-Forest	True-Positive	0.941	0.75	0.681
Random-Forest	AUC	0.933	0.706	0.716
Random-Forest	F-Measure	0.858	0.613	0.663
Random-Forest	False-Positive	0.14	0.278	0.369
Random-Forest	True-Positive	0.857	0.565	0.679

P@K

average users’ precision@k

Information gain

Applications Installation and Removal Analysis

Application DAta

Hashed User Id
Installed Application Number - the number of installed Facebook applications on the user's Facebook account,
Date - the date when the information was collected.

T-Test

Null hypothesis:

Two Sample t-test:

Add-on Users (µ = 0.236; stdev= 0.12)
Regular Users (µ = -0.19; stdev= 0.05)

T-test Results:

(t = 25.936; p-value < 2.2e-16)

\overline{Regular User} = \overline{Addon Users}

\overline{Regular User} \neq \overline{Addon Users}

AppChangeRatio(u,d):=\frac{AppNum(u,0) \text{-} AppNum(u,d)}{AppNum(u,0)}

Regular Users

Addon Users

ApplicationChangePercent = 0.006Days + 0.05

R^2 = 0.736; p-value = 2.2e-16

ApplicationChangePercent = -0.002Days-0.125

R^2 = 0.57; p-value = 1.351e-12

Labeling Data is Hard

Unsupervised Anomaly Detection in Graphs Utilizing a Link Prediction Algorithm

Malicious Users Tend to Connect to Other Profiles Randomly

Topology Based

Feature Extraction

16 feautres

for directed

graphs

8 feautres for

undirected

graphs

◦ For undirected graphs:

Common Friends
Total Friends
Jaccard’s-Coefficent

\frac{|\Gamma(v) \cap \Gamma(u)|}{|\Gamma(v) \cup \Gamma(u)|}

|\Gamma(v) \cup \Gamma(u)|

|\Gamma(v) \cap \Gamma(u)|

|\Gamma(v)_{in}| \cap |\Gamma_{out}(u)|

\begin{cases} 1, & \text{if}\ (u,v)\in E \\ 0, & \text{otherwise} \end{cases}

◦ For directed graphs:

Transitive Friends
Opposite Direction Friends

Link Classification

Aggregation of The Results

\sum_{}

Meta Feature Exteraction

AbnormalityVertexProbability(v) := \frac{1}{|\Gamma(v)|}\sum\nolimits_{u \in \Gamma(v)}p(v,u)

We extracted 7 features

- the confidence that an edge is fake.

p(v,u)

Meta Feature Exteraction

EdgesProbabilitySTDV(v) := \sigma(EP(V))

SumEdgeLabel(v) := \sum\nolimits_{u \in \Gamma(v)} EdgeLabel(v,u)

MeanPredictedLinkLabel(v) := \frac{1}{|\Gamma(v)|}\sum\nolimits_{u \in \Gamma(v)} EdgeLabel(v,u)

PredictedLabelSTDV(v) := \sigma(\lbrace EdgeLabel(v,u) | u \in \Gamma(v), u,v \in V \rbrace)

EdgesProbabilityMedian(v) := median(EP(V))

EdgeCount(v) := |\Gamma(v)|

outline

Datasets

Network	Is Directed	Vertices Number	Links Number	Date	Labeled
Academia	Yes	200,169	1,389,063	2011	No
Anybeat	Yes	12,645	67,053	2011	No
ArXiv HEP-PH	No	34,546	421,578	2003	No
CLASS OF 1880/81	Yes	53	179	1881	Yes
DBLP	No	1,665,850	13,504,952	2016	No
Google+	Yes	107,614	13,673,453	2012	No
Orkut	No	3,072,441	117,185,083	2012	No
Twitter	Yes	5,384,160	16,011,443	2012	Yes
Xing	No	1,053,754	2,161,968	2012	No
Yelp	No	249,443	3,563,818	2016	No

Fully Simulated Networks

	AUC	TPR	FPR	Precision
Simulation 1 (Arxiv HEP-PH)	0.991	0.889	0.011	0.904
Simulation 2 (DBLP)	0.997	0.994	0.064	0.993
Simulation 3 (Yelp)	0.993	0.917	0.007	0.937

Semi Simulated Networks

	AUC	TPR	FPR	Precision
Academia	0.999	0.998	0.000	0.997
Anybeat	1.000	0.996	0.001	0.996
Arxiv HEP-PH	0.997	0.953	0.004	0.965
DBLP	0.997	0.940	0.005	0.995
Flixster	0.992	0.990	0.092	0.990
Google+	1.000	0.999	0.000	0.999
Xing	0.999	0.955	0.005	0.951
Yelp	0.996	0.941	0.005	0.958

Real World Networks

Kids Friendship Network

AUC - 0.93

TPR - 0.91

FPR- 0.15

Twitter

Information gain

https://github.com/Kagandi/anomalous-vertices-detection

Publications

Michael Fire, Dima Kagan, Aviad Elishar, and Yuval Elovici, “Social Privacy Protector - Protecting Users’ Privacy in Social Networks”, The Second International Conference on Social Eco-Informatics (SOTICS), Venice, Italy, October, 2012 (Acceptance Rate: 28%).
Dima Kagan, Michael Fire, Aviad Elishar, and Yuval Elovici, “Facebook Applications’ Installation and Removal: A Temporal Analysis”, The Third International Conference on Social Eco-Informatics (SOTICS), Lisbon, Portugal, October, 2013 (Acceptance Rate: 29%).
Michael Fire, Dima Kagan, Aviad Elishar, and Yuval Elovici, “Friends or Foe? Fake Profile Identification in Online Social Networks”, Journal of Social Network Analysis and Mining (SNAM), Volume 4, 2014”.

Publications

Michael Fire, Dima Kagan, Aviad Elishar, and Yuval Elovici, “Fake profile identification: Making social networks safer (Poster)”, WRF Perfect Pitch Session, 2016 (winner of the 2016 Best Commercialization/Translation Potential prize).
Dima Kagan, Michael Fire, and Yuval Elovici, "Finding a needle in a haystack: detecting outliers in complex networks", NetSci-X, January 2017.
According to the study presented we submitted the following patent request. Michael Fire, Dima Kagan, Aviad Elishar, and Yuval Elovici, Method for Protecting User Privacy in Social Networks” (pending patent registration no. 13/688,276).

Malicious Profile Identification in Online Social Networks

Dima Kagan

Supervisors: MICHAEL FIRE, YUVAL ELOVICI

Complex Networks

Related Work

Link Prediction

Crowd Wisdom

Supervised Fake Profile Identification in Online Social Networks

Facebook App

connection strength heuristic

Browser addon

Architecture

Collected Data - Facebook App

COLLECTED DATA - ADDON

ML Datasets

COLLECTED DATA

Additional Features

P@K

average users’ precision@k

Information gain

Applications Installation and Removal Analysis

Application DAta

T-Test

Regular Users

Addon Users

Labeling Data is Hard

Unsupervised Anomaly Detection in Graphs Utilizing a Link Prediction Algorithm

Malicious Users Tend to Connect to Other Profiles Randomly

Topology Based

Feature Extraction

Link Classification

Aggregation of The Results

Meta Feature Exteraction

Meta Feature Exteraction

outline

Datasets

Fully Simulated Networks

Semi Simulated Networks

Real World Networks

Kids Friendship Network

Twitter

Information gain

Publications

Publications

Questions?

Thesis long

More from Dima Kagan

Supervised Fake Profile
Identification in Online Social Networks