Network anomalies detection

Workflow

Gathering data
Cleaning data
Detection
- Unsupervised
- Supervised
Evaluation

Data

SNMP
Fortinet API
Packet captures
Synthetic data

Example 1

Anomaly detection in Network Traffic Using Unsupervised Machine learning Approach. Aditya Vikram, None Mohana. Published online June 1, 2020. doi:https://doi.org/10.1109/icces48766.2020.9137987‌

https://sci-hub.se/10.1109/ICCES48766.2020.9137987

Example 1

Data

NSL-KDD
41 features
https://www.kaggle.com/datasets/hassan06/nslkdd

Anomalies

DoS
R2L - Remote to local
Probe - find open ports
U2R - user to root

Classifiers

Isolation Forest
One-Class SVM

Example 1

80/20 data split

PCA used for reducing dimensions

Example 1

Isolation forest

Unsupervised decision-tree based
Consists of multiple isolation trees
Split data until each partition contains one data point
Random split on a random feature
- Feature chosen at random
- Random point between min and max for a split
Idea is that outliers require less splits to separate

Isolation forest

Anomaly score

s(x,n) = 2^{-\frac{E(h(x))}{c(n)}}

E(h(x)) - average path length to isolate x across trees
c(n) - normalization factor

One-class SVM

In general SVM is supervised

One-class SVM

In One-class SVM we only have 1 class so we use origin as a "fake" second class.

Parameters: kernel, ν

Example 2

Combining Unsupervised Approaches for Near Real-Time Network Traffic Anomaly Detection. Carrera F, Dentamaro V, Stefano Galantucci, Iannacone A, Donato Impedovo, Pirlo G. Applied Sciences. 2022;12(3):1759-1759. doi:https://doi.org/10.3390/app12031759‌

https://www.mdpi.com/2076-3417/12/3/1759

Example 2

Data

KDD99, NSL-KDD, CIC-IDS2017
https://www.kaggle.com/datasets/hassan06/nslkdd

Anomalies

Zero-day attacks

Classifiers

Deep Autoencoding Gaussian Mixture Model with Extended Isolation Forest (DAGMM-EIF)
Deep Autoencoder with Extended Isolation Forest (DA-EIF)
Memory-augmented Deep Autoencoder with Extended Isolation Forest (MemAE-EIF)
.....

capture important features in the lower dimensional representation

Autoencoder

When decoding data we end up with some reconstruction error E(x, x')
If the reconstruction error is above some threshold value we call that anomaly

Autoencoder

clustering ML method
composed of multiple Gaussians
in our case data is features + reconstruction error

Gaussian Mixture Model

Issue: Autoencoder can generalize and reconstruct anomalies
Memory module, add normal items during training
During reconstruction retrieve these normal items

Memory Augmented Deep Autoencoder

Extended Isolation Forest

Isolation Forest but we add slope

Algorithms tested

Extended Isolation Forest (EIF)
Ensemble Consensus (EC)
Deep Autoencoding Gaussian Mixture Model (DAGMM)
Deep Autoencoding Gaussian - Extended Isolation Forest (DAGMM-EIF)
Deep Autoencoder (DA)
Deep Autoencoder - Extended Isolation Forest (DA-EIF)
Deep Structured Energy Based Models (DSEBM)
Memory-Augmented Deep Autoencoder (MemAE)
Memory-Augmented Deep Autoencoder—Extended Isolation Forest (MemAE-EIF)

Evaluation metrics

Precision
Recall
F1-Score
Accuracy
AUC for ROC

Receiver-operating characteristic

visual representation of model performance across all thresholds
higher AUC means better model

Results

The results obtained show that the MemAE-EIF algorithm achieves the best performance in terms of accuracy and F1-score for all the datasets examined. A high precision rate is equivalent to a low number of false positives, which are false alarms that experts in the field must handle.

Results

SHAP values for EIF model of MemAE-EIF algorithm

(KDDCUP99 dataset)

Example 3

https://arxiv.org/pdf/1809.02611

Exploiting SNMP-MIB Data to Detect Network Anomalies using Machine Learning Techniques. Al-Naymat G, Al-kasassbeh M, Al-Hawari E. arXiv.org. Published 2018. Accessed October 21, 2024. https://arxiv.org/abs/1809.02611‌

Example 3

Data

SNMP-MIB dataset

Al-Kasassbeh, M., Al-Naymat, G., & Al-Hawari, E. (2016). Towards Generating Realistic SNMP-MIB Dataset for Network Anomaly Detection. International Journal of Computer Science and Information Security, 14(9), 1162)

https://www.kaggle.com/datasets/malkasasbeh/network-anomaly-detection-dataset

Anomalies

DoS (TCP-SYN flooding, UDP flooding, ICMP-ECHO flooding, HTTP flood, Slowloris, Slowpost)
Brute force attack

Classifiers

AdaboostM1 with J48
Random Forest
MLP

AdaboostM1 with J48

J48 (same as C4.5?)

decision tree algorithm
maximize information gain on each split

AdaBoost

combine multiple weak learners (decision trees)
each next learner focuses on the samples that were misclassifed
prediction = weighted combination of all learners

Random Forest™

ensemble learning method made out of decision trees
bootstrap aggregation
random feature selection
majority vote

MLP - Multilayer Perceptron

feedforward neural network
fully connected
nonlinear activation function

Data

4998 records
34 MIB variables
But only using 8 variables from the interface group
70/30 split

Var	Name
1	ifInOctets
2	ifOutOctets
3	ifOutDiscards
4	ifInUcastPkts
5	ifInNUcastPkts
6	ifInDiscards
7	ifOutUcastPkts
8	ifOutNUcastPkts

Results

Example 4

Detecting network anomalies using machine learning and SNMP-MIB dataset with IP group. Manna A, Alkasassbeh M. arXiv.org. Published 2019. Accessed October 22, 2024. https://arxiv.org/abs/1906.00863‌

https://arxiv.org/pdf/1906.00863

Example 4

Data:

same as previous paper
https://www.kaggle.com/datasets/malkasasbeh/network-anomaly-detection-dataset

Anomalies:

icmp-echo, tcp-syn, udp-flood, httpFlood, slowloris, slowost, bruteforce

Classifiers

Random Forest
Decision tree
REP Tree

REP Tree

decision tree built on training data that aims for high accuracy
REP - reduced error pruning
- start from bottom of the tree, evaluate if each subtree can be replaced with a leaf
- compare accuracy against test data

Data

Variable Name	Variable Description
ipInReceives	The total number of input datagrams that are received from the interfaces, including those received in error.
ipInDelivers	The total number of input datagrams that are delivered to the IP user protocols successfully (including ICMP).
ipOutRequests	The total number of IP datagrams supplied to IP in requests for transmission, not including ipForwDatagrams.
ipOutDiscards	The number of output datagrams that do not have errors preventing their transmission to their destination.
ipInDiscards	The number of input datagrams that do not have errors preventing their transmission to their destination.
ipForwDatagrams	The number of input datagrams for which this entity was not their final destination.
ipOutNoRoutes	The number of datagrams discarded because no route could be found to transmit them to their destination.
ipInAddrErrors	The number of input datagrams discarded because the IP address in their destination field was not valid.

Results

Similar results for 5 and 3 variables (ReliefFAttributeEval, InfoGainAttributeEval)

Example 5

Evaluation of Machine Learning Algorithms for Anomaly Detection. Nebrase Elmrabit, Zhou F, Li F, Zhou H. Zenodo (CERN European Organization for Nuclear Research). Published online June 1, 2020. doi:https://doi.org/10.1109/cybersecurity49315.2020.9138871‌

https://annas-archive.org/scidb/10.1109/CyberSecurity49315.2020.9138871

Example 5

Data:

NSW-NB15
CICIDS2017
ICS cyber-attack dataset

Anomalies:

DoS, port scanning, SQL injection, brute force, worms and other associated vulnerabilities found in datasets

Algorithms

Logistic regression
Gaussian Naive Bayes
K-nearest neighbors
Decision tree
Adaptive boosting
Random forest algorithm
Convolutional neural network
Convolutional neural network and Long short-Term Memory
Long short-Term memory
Gated recurrent units
Simple recurrent neural network
Deep neural network

Results

Network anomalies detection

By Sasa Trivic

Network anomalies detection

Network anomalies detection

Workflow

Data

Example 1

Example 1

Example 1

Example 1

Isolation forest

Isolation forest

Isolation forest

One-class SVM

One-class SVM

Example 2

Example 2

Autoencoder

Autoencoder

Gaussian Mixture Model

Memory Augmented Deep Autoencoder

Extended Isolation Forest

Algorithms tested

Evaluation metrics

Receiver-operating characteristic

Results

Results

Results

Example 3

Example 3

AdaboostM1 with J48

Random Forest™

MLP - Multilayer Perceptron

Data

Results

Example 4

Example 4

REP Tree

Data

Results

Example 5

Example 5

Algorithms

Results

Results

Results

Network anomalies detection

More from Sasa Trivic