From Black to White, From Passive to Proactive - Developing an Explainable Reference-Based System for Comprehensive Phishing Monitoring
COM1 Level 3
MR1, COM1-03-19
closeAbstract:
Phishing, an increasingly prevalent form of social engineering, causes substantial financial and security harm to both individuals and organizations. The ease of creating phishing webpages, coupled with the advanced evasion techniques employed by attackers, poses considerable challenges for designing effective anti-phishing strategies. Furthermore, the ephemeral nature of these phishing sites demands that anti-phishing systems be both cost-effective and maintain a high detection rate.
Despite ongoing research, current phishing detection systems exhibit notable limitations in addressing these challenges. Primarily, these models lack effectiveness for real-time application. Additionally, they rely on manually crafted features, which means the learned features are biased towards the phishing training set, potentially making them vulnerable as attackers adapt their strategies. The black-box nature of these systems’ decision-making processes also lacks explainability, which is a critical factor for gaining the trust of end-users. Moreover, existing approaches are reactive rather than proactive, making them incapable of discovering emerging phishing threats.
In response to these critical issues, this thesis introduces an innovative framework for proactive phishing defense. Our multi-layered approach comprises:
A novel reference-based phishing detector, offering superior performance over existing methods. Our detector is both interpretable, highlighting the phishing target and credential-taking areas, and effectively generalizes to real-world phishing without training on any collected phishing webpages. Technically, this approach integrates advancements in computer vision, large language models, and web testing techniques to (i) identify the brand impersonated by the phishing webpage through comparison with a reference list of protected brands. (ii) locate and classify the types of credentials the webpage seeks to acquire. (iii) perform counterfactual testing to detect any suspicious activities on the site.
A significant contribution to the academic community through the compilation of the most comprehensive phishing benchmarks to date. This includes a static benchmark of 29,496 phishing webpages, providing a robust tool for evaluating various phishing detectors, and a dynamic benchmark of 6,344 phishing kits, offering a secure, reproducible environment to analyze phishing tactics.
Building on these elements, we developed a proactive phishing monitoring system that offers an end-to-end solution, ranging from website crawling and incident reporting to trend visualization. This system focuses on examining emerging websites and can generate an average of 50 zero-day phishing alerts daily. Such insights into the evolving phishing landscape and tactics are invaluable in guiding the development of defensive strategies for government agencies, web service providers, business affiliates, and individual users.