Network anomalies detection

Workflow

  • Gathering data
  • Cleaning data
  • Detection
    • Unsupervised
    • Supervised
  • Evaluation

Data

  • SNMP
  • Fortinet API
  • Packet captures
  • Synthetic data

Example 1

Anomaly detection in Network Traffic Using Unsupervised Machine learning Approach. Aditya Vikram, None Mohana. Published online June 1, 2020. doi:https://doi.org/10.1109/icces48766.2020.9137987‌

Example 1

Anomalies

  • DoS
  • R2L - Remote to local
  • Probe - find open ports
  • U2R - user to root

Classifiers

  • Isolation Forest
  • One-Class SVM

Example 1

80/20 data split

PCA used for reducing dimensions

Example 1

Isolation forest

  • Unsupervised decision-tree based
  • Consists of multiple isolation trees
  • Split data until each partition contains one data point
  • Random split on a random feature
    • Feature chosen at random
    • Random point between min and max for a split
  • Idea is that outliers require less splits to separate

Isolation forest

Isolation forest

Anomaly score

s(x,n) = 2^{-\frac{E(h(x))}{c(n)}}
  • E(h(x)) - average path length to isolate x across trees
  • c(n) - normalization factor

One-class SVM

In general SVM is supervised

One-class SVM

In One-class SVM we only have 1 class so we use origin as a "fake" second class.

Parameters: kernel, ν

Example 2

Combining Unsupervised Approaches for Near Real-Time Network Traffic Anomaly Detection. Carrera F, Dentamaro V, Stefano Galantucci, Iannacone A, Donato Impedovo, Pirlo G. Applied Sciences. 2022;12(3):1759-1759. doi:https://doi.org/10.3390/app12031759‌

Example 2

Data

Anomalies

  • Zero-day attacks

Classifiers

  • Deep Autoencoding Gaussian Mixture Model with Extended Isolation Forest (DAGMM-EIF)
  • Deep Autoencoder with Extended Isolation Forest (DA-EIF)
  • Memory-augmented Deep Autoencoder with Extended Isolation Forest (MemAE-EIF)
  • .....
  • capture important features in the lower dimensional representation

Autoencoder

  • When decoding data we end up with some reconstruction error E(x, x')
  • If the reconstruction error is above some threshold value we call that anomaly

Autoencoder

  • clustering ML method
  • composed of multiple Gaussians
  • in our case data is features + reconstruction error

Gaussian Mixture Model

  • Issue: Autoencoder can generalize and reconstruct anomalies
  • Memory module, add normal items during training
  • During reconstruction retrieve these normal items

Memory Augmented Deep Autoencoder

Extended Isolation Forest

Isolation Forest but we add slope

Algorithms tested

 

  • Extended Isolation Forest (EIF)
  • Ensemble Consensus (EC)
  • Deep Autoencoding Gaussian Mixture Model (DAGMM)
  • Deep Autoencoding Gaussian - Extended Isolation Forest (DAGMM-EIF)
  • Deep Autoencoder (DA)
  • Deep Autoencoder - Extended Isolation Forest (DA-EIF)
  • Deep Structured Energy Based Models (DSEBM)
  • Memory-Augmented Deep Autoencoder (MemAE)
  • Memory-Augmented Deep Autoencoder—Extended Isolation Forest (MemAE-EIF)

Evaluation metrics

  • Precision
  • Recall
  • F1-Score
  • Accuracy
  • AUC for ROC

Receiver-operating characteristic

  • visual representation of model performance across all thresholds
  • higher AUC means better model

Results

Results

The results obtained show that the MemAE-EIF algorithm achieves the best performance in terms of accuracy and F1-score for all the datasets examined. A high precision rate is equivalent to a low number of false positives, which are false alarms that experts in the field must handle.

Results

SHAP values for EIF model of MemAE-EIF algorithm

(KDDCUP99 dataset)

Example 3

Exploiting SNMP-MIB Data to Detect Network Anomalies using Machine Learning Techniques. Al-Naymat G, Al-kasassbeh M, Al-Hawari E.  arXiv.org. Published 2018. Accessed October 21, 2024. https://arxiv.org/abs/1809.02611‌

Example 3

Data

  • SNMP-MIB dataset

Al-Kasassbeh, M., Al-Naymat, G., & Al-Hawari, E. (2016). Towards Generating Realistic SNMP-MIB Dataset for Network Anomaly Detection. International Journal of Computer Science and Information Security, 14(9), 1162)

Anomalies

  • DoS (TCP-SYN flooding, UDP flooding, ICMP-ECHO flooding, HTTP flood, Slowloris, Slowpost)
  • Brute force attack

Classifiers

  • AdaboostM1 with J48
  • Random Forest
  • MLP

AdaboostM1 with J48

J48 (same as C4.5?)

  • decision tree algorithm
  • maximize information gain on each split

AdaBoost

  • combine multiple weak learners (decision trees)
  • each next learner focuses on the samples that were misclassifed
  • prediction = weighted combination of all learners

Random Forest

  • ensemble learning method made out of decision trees
  • bootstrap aggregation
  • random feature selection
  • majority vote

MLP - Multilayer Perceptron

  • feedforward neural network
  • fully connected
  • nonlinear activation function

Data

  • 4998 records
  • 34 MIB variables
  • But only using 8 variables from the  interface group
  • 70/30 split
Var Name
1 ifInOctets
2 ifOutOctets
3 ifOutDiscards
4 ifInUcastPkts
5 ifInNUcastPkts
6 ifInDiscards
7 ifOutUcastPkts
8 ifOutNUcastPkts

Results

Example 4

Detecting network anomalies using machine learning and SNMP-MIB dataset with IP group.  Manna A, Alkasassbeh M. arXiv.org. Published 2019. Accessed October 22, 2024. https://arxiv.org/abs/1906.00863‌

Example 4

Data:

  • same as previous paper
  • https://www.kaggle.com/datasets/malkasasbeh/network-anomaly-detection-dataset

Anomalies:

  • icmp-echo, tcp-syn, udp-flood, httpFlood, slowloris, slowost, bruteforce

Classifiers

  • Random Forest
  • Decision tree
  • REP Tree

REP Tree

  • decision tree built on training data that aims for high accuracy
  • REP - reduced error pruning
    • start from bottom of the tree, evaluate if each subtree can be replaced with a leaf
    • compare accuracy against test data

Data

Variable Name Variable Description
ipInReceives The total number of input datagrams that are received from the interfaces, including those received in error.
ipInDelivers The total number of input datagrams that are delivered to the IP user protocols successfully (including ICMP).
ipOutRequests The total number of IP datagrams supplied to IP in requests for transmission, not including ipForwDatagrams.
ipOutDiscards The number of output datagrams that do not have errors preventing their transmission to their destination.
ipInDiscards The number of input datagrams that do not have errors preventing their transmission to their destination.
ipForwDatagrams The number of input datagrams for which this entity was not their final destination.
ipOutNoRoutes The number of datagrams discarded because no route could be found to transmit them to their destination.
ipInAddrErrors The number of input datagrams discarded because the IP address in their destination field was not valid.

Results

Similar results for 5 and 3 variables (ReliefFAttributeEval, InfoGainAttributeEval)

Example 5

Evaluation of Machine Learning Algorithms for Anomaly Detection.  Nebrase Elmrabit, Zhou F, Li F, Zhou H. Zenodo (CERN European Organization for Nuclear Research). Published online June 1, 2020. doi:https://doi.org/10.1109/cybersecurity49315.2020.9138871‌

Example 5

Data:

  • NSW-NB15
  • CICIDS2017
  • ICS cyber-attack dataset

Anomalies:

  • DoS, port scanning, SQL injection, brute force, worms and other associated vulnerabilities found in datasets

Algorithms

  • Logistic regression
  • Gaussian Naive Bayes
  • K-nearest neighbors
  • Decision tree
  • Adaptive boosting
  • Random forest algorithm
  • Convolutional neural network
  • Convolutional neural network and Long short-Term Memory
  • Long short-Term memory
  • Gated recurrent units
  • Simple recurrent neural network
  • Deep neural network

Results

Results

Results

Network anomalies detection

By Sasa Trivic

Network anomalies detection

  • 168