HyDNN: A Hybrid Deep Learning Approach for Phishing URL Detection

Phishing Feature Selection Dimensional Reduction Machine Learning Classifiers Deep Learning HyDNN

Authors

Downloads

Phishing is an online attack in which attackers trick victims into disclosing their sensitive information, such as credentials, financial portal pins, and OTPs, with the intention of identity or financial theft, jeopardizing reputations, and posing a risk to netizens. As the stakes are high, attackers invest considerable effort and time in committing organized crimes to steal valuable user information. The research carried out aims to detect phishing websites using machine learning and deep learning models. In this research, the classification models are applied to three different phishing website datasets, namely the Mendeley phishing dataset and the UCI dataset, which belong to binary classification, and one dataset that falls under multi-class classification. These datasets are publicly available for research. A custom data set is also prepared from recently available websites to reduce the potential bias in the already available data set. The reason for choosing a publicly available dataset is to validate and compare the results obtained from the custom dataset. To optimize the process, various feature selection techniques and dimensional reduction methods are applied, and a comparison of all approaches is summarized. Performance metrics are used for binary and multi-class classification, and then the outcomes obtained are summarized. The Random Forest model performs well with most feature selection techniques by achieving the best accuracy of 98.24% using the embedded feature selection approach for the Mendeley data set, 94.78% for the UCI data set, and 90.57% for the custom data set. Hence, using Random Forest as the base model, deep learning approaches, namely, CNN and LSTMs, are used to check the efficiency. This study shows that the proposed Hybrid Deep Neural Network approach, HyDNN, performs better, providing the best result with an accuracy of 98.87% for the Mendeley dataset, 97.63% for the UCI dataset, and 93.77% for the custom dataset.