Identify, Block malicious network traffic with Aeternus Argus

Discover XGBoost, the state of the art model for advance classification training. Used in every field for generic purposes

Get started

scroll down for more information

About Me

Dev recruitment

I'm currently recruiting developers to collaborate on exciting projects. If you specialize in any of the following fields, please email me: at [email protected] I especially need a UI/UX designer, Minecraft mod developer, CS2 external/DMA developer, reverse engineering of malware, penetration testing, botnet development, or native development.

Personal Bio

Hi, I'm Neil Huang, a software engineer and a data scientist. I'm passionate about building software and machine learning models that can help people solve real-world problems. I'm currently working on a project called Aeternus Argus, which is a network security tool that can help you identify and block malicious network traffic. I'm also a big fan of open-source software, and I love to contribute to the community whenever I can. If you have any questions or just want to chat, feel free to reach out to me on Github or LinkedIn!

Developing Experience

I have been working as a full stack developer for over 8 years, gaining extensive experience in various programming languages including Python, C++, Java, JavaScript, TypeScript, Go, Kotlin, C, and even Scratch. My journey in software development started in lower school when i was 6 years old. I am currently ranked at the Platinum level in the USA Computing Olympiad (USACO), I find these algorithm contests really fun and they can exercise my brain. My passion for new algorithms drives me to continuously learn and innovate in the field. My expertise spans both frontend and backend development, enabling me to build comprehensive and efficient software solutions. You can explore some of my exciting projects on my GitHub profile, where I regularly contribute to open-source communities. In addition to my development work, I am also a reverse engineer and AI enthusiast. I specialize in cybersecurity and bypassing kernel-level anti-cheats and analyzing malware behavior, which involves a deep dive into system internals and security mechanisms.

Learn More about XGBoost

Discover XGBoost, the state-of-the-art model for advanced classification training. XGBoost builds an ensemble of decision trees to deliver robust and accurate predictions, making it a favorite among data scientists.

With its efficiency and scalability, XGBoost is widely used across various industries including finance, healthcare, and marketing. This powerful model leverages gradient boosting techniques to combine weak learners into a strong predictive engine.

Explore the mechanics behind gradient boosting, decision tree ensembles, and regularization techniques that prevent overfitting. Learn how parallelization and automatic handling of missing data give XGBoost its speed and robustness.

Whether you’re just getting started or are a seasoned practitioner, the insights provided here will help you harness the full potential of XGBoost for your projects.

Introduction to XGBoost

What is XGBoost?

XGBoost stands for “Extreme Gradient Boosting” and is one of the most popular machine learning algorithms. At its core, it teaches computers to make predictions based on data by combining multiple simpler models (decision trees) iteratively.

Why is XGBoost Popular?

XGBoost is renowned for its exceptional performance in tasks like classification and regression. Its speed, efficiency, and consistent performance in machine learning competitions make it a favorite among data scientists.

How XGBoost Works in Detail

The Gradient Boosting Process

To understand XGBoost, it’s essential to know about gradient boosting. This iterative process builds models step-by-step, each time correcting the errors of previous models using the gradient of a loss function.

Decision Trees and Loss Functions

In XGBoost, decision trees are combined with a loss function that measures prediction errors. Each new tree minimizes this loss, leading to increasingly accurate predictions.

Regularization and Overfitting

Overfitting occurs when a model learns noise from the training data. XGBoost uses regularization techniques to penalize overly complex models, keeping the trees simple and improving generalization.

Parallelization and Speed

XGBoost supports parallel processing and distributed computing, enabling it to scale with large datasets and deliver fast predictions even in real-time applications.

Handling Missing Data

XGBoost automatically learns how to handle missing data during training, reducing the need for extensive data preprocessing.

Applications of XGBoost

XGBoost in Finance

In finance, XGBoost is used for fraud detection, credit scoring, and stock price prediction by analyzing complex datasets and identifying subtle patterns.

XGBoost in Healthcare

Healthcare applications include predicting disease likelihood, optimizing treatment plans, and analyzing patient data, supporting early intervention and personalized care.

XGBoost in Marketing

Marketing teams leverage XGBoost to predict consumer behavior, personalize recommendations, and optimize campaigns by analyzing customer data from browsing and purchase histories.

XGBoost in Sports Analytics

Sports teams use XGBoost to predict game outcomes, evaluate player performance, and analyze strategies, gaining insights that drive competitive advantage.

XGBoost in E-commerce

E-commerce platforms employ XGBoost for demand forecasting, pricing optimization, and personalized product recommendations, enhancing customer experience and revenue.

Similar Works

Explore state-of-the-art research and methods used for VPN detection and network traffic analysis.

Deep Learning Approaches

Recent studies have explored deep learning methodologies for VPN traffic classification. Sun et al. proposed a method that transforms network traffic into images using a concept called Packet Block, which aggregates continuous packets in the same direction. These images are then processed using Convolutional Neural Networks (CNNs) to identify application types, achieving high accuracy on datasets like OpenVPN and ISCX-Tor. Read more.

Graph-Based Models

Graph-based models have also been investigated for their efficacy in traffic classification. Xu et al. developed VT-GAT, a model based on Graph Attention Networks (GAT) that constructs traffic behavior graphs from raw data. This approach extracts behavioral features more effectively than traditional methods. Learn more.

Machine Learning Techniques

Several studies have applied traditional machine learning algorithms to classify VPN traffic. Mohamed and Kurnaz, for example, used Artificial Neural Networks (ANN) to identify VPN flows, while Nigmatullin et al. combined the Differentiation of Sliding Rescaled Ranges (DSRR) approach with Random Forest classifiers, achieving high precision and recall. Read study & Learn more.

Transformer-Based Models

Transformer-based architectures are emerging as powerful tools for encrypted traffic classification. Lin et al. introduced ET-BERT, which pre-trains deep contextualized datagram-level representations from large-scale unlabeled data, setting new benchmarks across multiple classification tasks. Discover more.

Comparative Analyses

Comparative studies have evaluated various classifiers for VPN traffic. Draper-Gil et al. assessed different algorithms using time-related features to distinguish between VPN and non-VPN traffic, offering insights into the strengths and limitations of each approach. Read the analysis.

Research: VPN Classification & Model Training

Background

This project is lead and developed by Neil Huang as part of the Polygence Program, within the program Maria Konte provided guidance. It focuses on developing a machine learning model to classify network traffic as either VPN or Non-VPN. Using features extracted from network flow data, the goal is to distinguish between these two types of traffic. The dataset is sourced from the ISCX VPN and Non-VPN 2016 dataset, and the model uses XGBoost with Optuna for hyperparameter optimization.

Classifying VPN traffic is important for traffic management, security, and network performance analysis. This research builds a robust classifier that differentiates between VPN and Non-VPN traffic using data-driven methods.

Data Collection & Exploration

The dataset includes multiple scenarios (A1, A2, and B) representing different VPN and Non-VPN traffic characteristics. The SetupData.py script automates the process of downloading, extracting, and preprocessing the data.

Features Explanation

VPN traffic often shows more regular, predictable patterns with consistent packet and byte transfer rates. Non-VPN traffic tends to be more sporadic, with shorter bursts and irregular idle periods. By analyzing these time-based features, we can effectively differentiate between VPN and Non-VPN traffic.

Heatmap for flowBytesPerSecond & flowPktsPerSecond

Scatter Plot for flowBytesPerSecond vs. flowPktsPerSecond

Model Training & Hyperparameter Tuning

Data Preprocessing

The Train.py script implements several preprocessing steps to prepare the dataset, including handling missing data, converting non-numeric columns, and using Stratified K-Fold Cross-Validation for balanced splits.

# Check for missing values
print("Missing values count:\n", df.isnull().sum())

# Drop columns with more than 50% missing data
missing_threshold = 0.5
df = df.dropna(thresh=int((1 - missing_threshold) * len(df)), axis=1)
print(f"Remaining columns after dropping columns with missing data: {df.columns}")

Hyperparameter Optimization with Optuna

The script uses Optuna to automatically search for the best hyperparameters for the XGBoost model. Key parameters like n_estimators, learning_rate, max_depth, and subsample are tuned.

def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 200),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.1, log=True),
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'gamma': trial.suggest_float('gamma', 0, 10),
        'min_child_weight': trial.suggest_float('min_child_weight', 1e-3, 10.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-3, 10.0),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-3, 10.0),
        'scale_pos_weight': trial.suggest_float('scale_pos_weight', 1e-3, 10.0),
    }
    model = XGBClassifier(**params)
    fold_accuracies = []
    for train_index, test_index in kf.split(df[available_features], y):
        X_train = df.iloc[train_index][available_features]
        y_train = y.iloc[train_index]
        X_test = df.iloc[test_index][available_features]
        y_test = y.iloc[test_index]
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        acc = accuracy_score(y_test, y_pred)
        fold_accuracies.append(acc)
    return np.mean(fold_accuracies)

Threshold Optimization

After training, the model's performance is improved by adjusting the classification threshold based on the ROC Curve to balance false positives and false negatives.

# Compute ROC curve and AUC
fpr, tpr, thresholds = roc_curve(y, model.predict_proba(X_test)[:, 1])
roc_auc = auc(fpr, tpr)

# Find the optimal threshold
optimal_threshold = thresholds[np.argmax(tpr - fpr)]
print(f"Optimal threshold: {optimal_threshold}")

# Apply optimal threshold
y_pred_optimal = (model.predict_proba(X_test)[:, 1] >= optimal_threshold).astype(int)

Before & After Optimization

Before Optimization Best trial accuracy: 0.8983698375392972 Best parameters: (see snippet below)

Best trial accuracy: 0.8983698375392972
Best parameters: {'n_estimators': 137, 'learning_rate': 0.056998467188966166, 'max_depth': 10, 'subsample': 0.6885374473966026, 'colsample_bytree': 0.8316744476524894, 'random_state': 28, 'gamma': 6.846816972984089, 'min_child_weight': 1.9011750589572292, 'reg_alpha': 8.788661748447671, 'reg_lambda': 9.17186993955007, 'scale_pos_weight': 1.5367434543465421, 'max_delta_step': 8, 'colsample_bylevel': 0.8977692687002093, 'colsample_bynode': 0.6884480938017858}

After Optimization Best trial accuracy: 0.9093782145477146 This ~2% improvement in accuracy is significant, indicating that threshold tuning and parameter optimization help the model generalize better.

Best trial accuracy: 0.9093782145477146
Best parameters: {'n_estimators': 153, 'learning_rate': 0.06057883562221319, 'max_depth': 10, 'subsample': 0.7267015030731455, 'colsample_bytree': 0.6631537892477964, 'random_state': 17, 'gamma': 2.368698710253576, 'min_child_weight': 8.637108660931315, 'reg_alpha': 9.630173474548245, 'reg_lambda': 0.14523433378303974, 'scale_pos_weight': 3.7039133272302873, 'max_delta_step': 10, 'colsample_bylevel': 0.6220516152279684, 'colsample_bynode': 0.5067381656563739, 'threshold': 0.6872133792668326}

Results Explanation

The optimized model now focuses more on key features and yields better performance. Regularization and threshold optimization both contributed to reducing false positives and improving overall accuracy.

Contributing

We welcome contributions to improve this project! If you’d like to contribute, please see the GitHub repository for detailed instructions on forking, creating new branches, making changes, and submitting pull requests.