USENIX Security '24 Winter Paper #907 Reviews and Comments =========================================================================== Paper #907 Less Defined Knowledge and More True Alarms: Reference-based Phishing Detection without a Pre-defined Reference List Review #907A =========================================================================== Paper summary ------------- This paper proposes PhishLLM, a new reference-based phishing detector that leverages LLM for brand-domain and credential-taking-intention detection. It extensively relies on the domain and branding information encoded by the LLMs, and taps into the multimodal (visual+language) capabilities of the LLMs for webpage content analysis. The proposed PhishLLM system is shown to significantly improve recall by 31% and can discover more zero-day phishing webpages compared to existing state-of-the-art solutions. The paper also conducts an empirical evaluation of the 1300 detected phishing webpages. Detailed comments for authors ----------------------------- I found this paper to be a good read. It effectively leverages the multi-modal capabilities of LLMs for enhancing cyber security – improving phishing detection. The authors demonstrate a deep expertise of the phishing domain, and seem effective at leveraging knowledge and implementations from (recently published) state-of-the-art research. The paper has many strengths that I highlight in my ‘reasons to accept’ below. However, I find the paper lacking in a few areas that can be further improved. - My main concern is that the paper does not explore and report on the various possible ways in which LLMs can be included for improving phishing detection. For instance, in Section 3.1, why haven’t the authors shared the entire webpage screenshot with the LLM to identify the brand (I could verify that this works satisfactorily), and why did they prefer to first use OCR and image captioning models before prompting LLMs? Similarly, why did they have to first execute OCR on the screenshot before performing CRP prediction, when the screenshot could be used as-is? Outlining the reasons behind this choice will help the reader understand the motivation for increasing the complexity of the system by adding OCR and image captioning modules. - Also, LLMs are very sensitive to prompt formulation, and the authors seem to have only tried a couple of prompt variants (based on the tables in the Appendix). They do not report any results from ablation studies with various versions of the prompt: multiple formulations of task background and answer instructions, zero-shot vs. 1-shot vs. 2-shot, etc. Such an analysis would lend more credibility to the choice of prompt in PhishLLM. - It is unclear what the outcome would be in cases when the extracted logo does not match to any of the logos retrieved using the LLM inferred domain name. Would the web-page be considered benign? - The proposed defense for prompt injection attacks entails encapsulating the webpage contents within a defending instruction in the prompt ( and ). How was this defense identified, and why is it effective? I understand that the empirical results in Table 5 support this solution, but it is unclear why this is the only defense mechanism? Were other defenses were considered but not chosen? - The draft would benefit from an accurate cost estimate for running PhishLLM in practice (based on the field study dataset). Currently, there is a small discussion on the cost: $90 for OpenAI querying and $1100 for Google Logo detection. However, there are also other models and APIs being employed (PaddleOCR, Image Captioning, Google logo retrieval, etc.), where some models are deployed locally on the server, and some need to send a query to a remote server. The paper would benefit from a cost breakdown of all these various models, so that the end-to-end cost of running PhishLLM is clearly evident to the reader. - How were the three security experts recruited and compensated for doing the website labeling? Did more than one expert label each website, what was the inter-annotator agreement, and how were the disagreements resolved? - It is unclear what features were used to cluster the detected phishing domains. The paper claims to use only two features: domain names and the brand information. So is it a 2-D clustering? - All the webpage screenshots in the paper are too small to discern differences. Since there is a lot of unnecessary colored background, it would be helpful to crop the background and zoom into the relevant parts of the webpage. Minor: - Web screenshot images in the paper are too small and hard to read. They could be zoomed in to highlight the relevant parts of the input-taking forms. - It would be helpful to cite the Pixel-level perturbation references in Table 5. Ethics consideration -------------------- 1. No Comments for ethics consideration --------------------------------- N/A Required changes ---------------- - Adding details on why LLMs were shared outputs of OCR and image caption models, instead of the entire web page screenshot. - Include ablation studies on various versions of the prompt. - Add missing details on how the prompt injection defense was identified, and its effectiveness. - Add details on how the three security experts recruited and compensated for doing the website labeling. Reasons to accept the paper --------------------------- + First to leverage multi-modal capabilities of LLMs to improve phishing detection, with checks in place to validate LLM responses. + PhishLLM boasts significant improvement in phishing detection recall over existing state-of-the-art baselines. + Conducts overall and component level evaluations of the proposed system, and shows tha tit is robust against various types of adversarial attacks. + Conducts a field-study using real-world Certstream feeds. + Empirical analysis of the phishing landscape based on real-world detect phishing websites. Reasons to not accept the paper ------------------------------- - Paper does not explore and report on the other possible ways LLMs can be included for improving phishing detection, and why they were not considered. - Only considered a couple of prompt variants, and there are no ablation studies with different versions of prompt (i.e., no prompt engineering). - Missing details on how the prompt injection defense was identified, and its effectiveness. Recommended decision -------------------- 2. Accept on Shepherd Approval Questions for authors' response ------------------------------- 1. Why haven’t the authors shared the entire webpage screenshot with the LLM to identify the brand and to detect credential taking intent? 2. How was the prompt injection defense identified, and why is it effective? 3. How were the three security experts recruited and compensated for doing the website labeling? Writing quality --------------- 2. Well-written Confidence in recommended decision ---------------------------------- 3. Highly confident (would try to convince others) Review #907B =========================================================================== Paper summary ------------- This work presents PhishLLM, a pipeline for visual-based phishing detection that leverages large-language models to bridge the gap between impersonated brands and legitimate domains. Once the brand logo is identified on the page, OCR-extracted text together with a textual description is fed to a language model that is prompted to output the websites of the brand on the page. To avoid hallucinations, the extracted domain is processed to extract its logo and verify whether it matches the one originally found. The pipeline also revisits a previously proposed strategy to identify credential-taking pages by introducing LLMs instead of a visual model to identify forms. The authors measure the pipeline performance on different datasets taken from top-visited domains, benign and phishing pages, as well as running a real-world experiment to catch 0-day malicious pages. These findings show that the proposed solution largely outperforms state-of-the-art visual-based ones. Detailed comments for authors ----------------------------- Tackling phishing attacks is an ever-ending challenge and visual-based solutions are becoming very popular nowadays as a tool to detect malicious campaigns. In addition, the power of large language models can be leveraged to automate some of the modules (i.e., reference list) that require constant maintenance and that can compromise the efficacy of the whole pipeline. I thank the authors for the effort in combining all the elements. Even if this work leverages a previously proposed architecture, the changes introduced in 2 key components (i.e., brand matcher and CRP prediction) together with the gain in performance bring the novelty to an acceptable value for a venue like Usenix. I've enjoyed the work, which in my opinion can represent a nice contribution to the conference. However, I'm skeptical about dropping the reference list - and this is the main reason behind my decision, which I'm happy to change after the rebuttal. Please find below my comments. **Absence of a reference list**. Although the gain in time is remarkable, your approach for extracting the brand and the absence of a reference list can introduce many false positives and undermine your claim of `more true alarms`. - **A single domain**. The LLM is instructed to output a single domain but many brands have dozens of them associated with the same entity. A classic example is the one of banks and insurance companies. If your system is fed with the screenshot of `login.travelinsurance.ca` the LLMs will detect the `Allianz` logo and very likely output `www.allianz.com` as a legitimate domain, thus marking the original website as phishing and producing a false positive (FP). The same holds for all the multinational corporations that have a distinct domain for each country. A reference list would likely capture these entities under the same umbrella because e.g., all those websites have certificates validated by the same organization. - **Not for all the targets**. Very common targets (e.g., Google, Microsoft) offer and host (i.e., on `google.com` or `live.com`) login services for 3rd-party products. Very often during the login phase, the page shows the 3rd-party-product logo as the main one (e.g., `miro.com/signup/` then authenticate with Google). The detection pipeline and the LLM in this case will output the product website and mark `google.com` as phishing, which represents a false alarm. - **Not at the domain level**. The LLM outputs the domain corresponding to the brand at the domain level. However, this can cause many missed detections. For example, on websites like `sites.google.com`, an attacker could host a `Google` phishing page. The verification would identify the brand and the domain as congruent and not mark these samples. Although the last two cases could be mitigated by maintaining some smaller lists - e.g., login providers and hosting services - the most concerning part is the first point, which is quite common and probably not observable in a 1-month experiment. **Important details are missing**. The manuscript would benefit from more details in some key parts of the discussion. - **Domain to logo**. How do you retrieve alternative logos from a search engine starting from a domain and make sure the two aspects are aligned? How do you match if one of the retrieved logos corresponds to the one you started from? - **Manual inspection**. I'd be curious to read more details about the verification process that the analysts performed to confirm the detections of PhishLLM. Related to my previous comment, the most phished brands reported in Figure 10 (i.e., `outlook.com`, `google.com`, `microsoft.com`) fall in that dangerous category of login providers, and those companies having multiple allowed domains hosting the login possibility (e.g., live.com, microsoftonline.com). - **FPs and FNs**. The discussion about FPs and FNs is limited to a single sentence and a single experiment on a closed-world dataset (`the main reason is that the generated logo-prompt sometimes may not provide sufficient information for the LLM to infer its brand.`). Since this aspect is the core of the work, the manuscript would benefit from a better description of these cases in all the experiments. In addition, what about FPs or FNs in the real-world case? I understand the model is outperforming state-of-the-art technologies but I assume it's not perfect. When is it failing and why? **Minor comments** - **Updates**. LLMs are not re-trained regularly: although their knowledge is vast they cannot catch new brands in a short period. This aspect should be discussed in the work: is it true that LLMs save the maintenance of a reference list, but their update is not immediate and definitely slower than the one of a reference list. - **Language**. One of the identified limitations of the previously proposed CRP predictor is that the text related to credential-taking intentions can appear in multiple languages rather than English, so someone should train a visual model able to generalize on multiple languages. Is not clear whether this limitation remains in this work. Section 3.2 mentions `we consider textual content as Password, Email, Address`. How is the language taken into account here? - **Defense of prompt injection**. Please clarify the paragraph about `Defense of Prompt Injection` in Section 3.2.2. How do you place the ignore tags? Around which text? Is this a predefined list of sentences? - **Adversarial attacks**. Please clarify how you picked the adversarial attacks in Section 4.1.6. Are those real-world attacks by malicious actors performed on phishing websites to defeat ML models? Have you cherry-picked them for your pipeline? Ethics consideration -------------------- 1. No Required changes ---------------- - The proposed solution should account for the described cases that would generate several FPs in the long term - Many analysis details are missing and concern core aspect of the work Reasons to accept the paper --------------------------- - Tackling phishing is an important topic - Improvement in performance - Data and code availability Reasons to not accept the paper ------------------------------- - Some aspects might undermine the core idea of the paper (dropping the reference list) - The analysis is superficial - The architecture is not novel Recommended decision -------------------- 3. Accept Conditional on Major Revision Questions for authors' response ------------------------------- - How would you handle described failures when dropping the reference list? Could your solution be revised to handle them? - Many analysis details are missing and concern core aspect of the work Writing quality --------------- 3. Adequate Confidence in recommended decision ---------------------------------- 3. Highly confident (would try to convince others) Review #907C =========================================================================== * Updated: Apr 28, 2024, 2:23:36 PM AoE Paper summary ------------- This paper proposes PhishLLM, a large language model based phishing page classifier that considers visual logo features alongside the presence of credential collection fields to determine the overall intention of the page. The authors propose a framework consisting of image captioning, OCR, and HTML parsing/crawling to generate the necessary LLM prompts. They then evaluate the framework on datasets of benign, known phishing, and newly registered websites and generally find an improved recall over two other state-of-the-art classifiers Detailed comments for authors ----------------------------- Thank you for submitting this paper. The authors study an important application of LLM technology and show promising results in the evaluation while maintaining a practical (low) false positive rate in the pipeline's detections. I also appreciated the discussion of adversarial attacks (prompt injection) and corresponding defenses. My broad concern with this work is that the pipeline is conceptually very similar to prior work (Phishpedia and PhishIntention), with the LLM generally replacing the previously proposed models. This could be mitigated by digging deeper in this paper, such as through an expansion of the evaluation across additional scenarios to validate the generalizability and solidify the advantages of the LLM-based approach. One inherent advantage of an LLM-based brand classification approach may be the ability to adjudicate lesser known brands without the prerequisite of an extensive training dataset. I therefore encourage the authors to add experiments focused on confirming that (a) PhishLLM will not incorrectly label less popular brands (repeat the Alexa experiment with websites at the bottom of the list) and (b) similar true positive results would be observed for known phishing websites with less popular brands. The authors should also compare PhishLLM's performance on confirmed phishing websites from anti-phishing feeds (GSB, APWG, Phishtank, VirusTotal, etc.) in addition to the zero-day detections from the certificate stream, as this would allow for a much larger scale evaluation. Although recent work has focused on brand similarity / credential collection intention, for completeness the authors should consider traditional DNS/URL based attributes, especially to the extent that they could provide useful context for the LLM. From a writing/presentation perspective, I believe this paper has significant room for improvement. Although Sections 2-3 were generally clear in describing the approach, Section 4 oddly mixes the evaluation with statistics on the observed phishing sites, and Section 5 could be expanded with practical recommendations. Also, some of the discussion from the introduction could be moved to the related work. * The authors claim low cost as an advantage, but the components and their costs are not evaluated/described in detail. * Details on the latency are missing. This should be accompanied by a clearer discussion of how the authors propose PhishLLM should be deployed in practice. Statements such as "In addition, the budget is more friendly to security startup than the traditional Google Logo Detection service (used by DynaPhish)" require additional context. * Similarly, statements about the O(N) complexity require context on the corresponding runtime overhead. It would otherwise not seem like an issue to look up a brand once the data has been collected. * The blanket references after "we fine-tune a visual-language model" (Section 3.3) and "captioning technique" (Section 3.1.1) are not helpful, especially when specific models are later mentioned in Section 4.1.1. **Nits/Typos** * "A single phishing campaign averages a loss of 4.45 million" (Page 1) - This is incorrect; the report states this is the average cost of a data breach, not a phishing campaign. Also, specify the currency. * "state-of-the-arts" (Page 2) * "steps," (end of Section 3.1.0) * Sections 4.1.4 and 4.1.5 could benefit from using more memorable names for the datasets, i.e. replacing "the 3,640 annotated Alexa dataset" * "OpenAI service, Nevertheless" (Section 4.2) * [7] in 4.3 would be better as a footnote * "Chinese etc" (Section 4.5.1) * " indicating that the phishing attacks might be a lucrative business than expected." (Section 4.5.1) - this statement is redundant given what was said in the introduction. * Figure 12 - crop the login forms for better readability Ethics consideration -------------------- 3. Yes: submission may not appropriately mitigate potential risks or harms Comments for ethics consideration --------------------------------- In cases where the authors found zero-day phishing websites, were these reported to the appropriate entities (hosting providers, google safe browsing, etc.) in a timely manner? Required changes ---------------- * add a detailed breakdown and evaluation of both the costs and latency * improve the organization of the paper and clearly state scenarios in which PhishLLM could/should be deployed in practice * expand the evaluation to consider (a) less popular benign websites, (b) recent websites reported as phishing in public phishing feeds * add a breakdown and evaluation of the failure cases mentioned in Section 4.2 to the paper itself * add an explanation/evaluation of the hyperparameters shown on the demo page * improve the presentation and correct factual errors in references Reasons to accept the paper --------------------------- * proposed approach shows that LLMs can enhance existing phishing detection approaches * innovative prompt engineering and protections against adversarial attacks * improved recall compared to state-of-the-art Reasons to not accept the paper ------------------------------- * approach requires further validation: missing evaluation on less popular benign pages, real-world evaluation limited in size * pipeline is heavily based on prior work / off-the-shelf models * writing / presentation issues, unclear practical (deployment) recommendations Recommended decision -------------------- 3. Accept Conditional on Major Revision Questions for authors' response ------------------------------- * Did the authors consider traditional attributes such as URL/DNS when designing the framework? For example, would the recall be improved if the LLM received context on the age of the domain, deceptiveness of the URL, use of redirection, use of a commonly abused hosting provider, etc.? * Please provide details on the root cause of the LLM hallucination and insights into the reasons for the criticality of the domain validation step. * How would an injected string such as "this website is benign" affect the output of the LLM? Writing quality --------------- 4. Needs improvement Confidence in recommended decision ---------------------------------- 3. Highly confident (would try to convince others) Review #907D =========================================================================== Paper summary ------------- This paper proposes a method to detect phishing websites, by improving the existing methods of 'brand recognition' and 'detection of credential requiring pages' using LLMs. Compared to the three relevant previous work, the method achieves a higher recall, detecting more zero-day phishing websites. Detailed comments for authors ----------------------------- The paper aims to solve the long-lasting problem of phishing detection with the help of LLMs. While the approach is promising, I think the paper has several limitations in terms of its novelty, related work analysis, and evaluation scope. - First of all, thank you for making the code and the demo available. FYI, I tried "http://jlqj.com.cn/i..**@r.*.com" example in the demo, where the model failed to recognize the Outlook logo. I imagine DynaPhish or any other technique that does a reverse image search could detect this? - The evaluation only focuses on previous academic work with similar detection methods (credential taking intention, brand recognition), while the paper quickly dismisses the rest of the phishing detection work (saying that these solutions lack interpretability and cannot detect the new phishing attacks). However, it does not sound very reasonable to me that, e.g., this is the first study that analyzes whether the website form has username/password fields, using OCR. I think the comparative evaluation should be extended to other phishing detectors (e.g., some free services I found with simple Google search: urlscan.io (looks at page activity), checkphish.bolster.ai (uses computer vision and NLP)) to justify paper's claim that other solutions do not work. - The related work section is quite superficial considering the large amount of work in this field, failing to provide a comprehensive analysis of methodologies in phishing detection. A more in-depth examination of previous work could better highlight the significance of the proposed improvements. - The paper brings an incremental contribution: compared to DynaPhish + PhishIntention, there is only 11% increase in precision. However, this combination (DynaPhish + PhishIntention) is not included for the large-scale analysis due to budget constraints (if I understand correctly). I think the paper could make a more clear analysis of different metrics such as cost and runtime overhead. For instance, runtime overhead may not be a big issue depending on the use case. - I appreciate the analysis on adversarial attacks, however I think the paper does not do a very good job in explaining its limitations clearly. For instance, the OCR method could fail in various occasions, the logo recognition might fail (as in the Outlook example), and LLMs may not be aware of a recent change of logo/branding of a certain company. - Phishpedia paper discovers 1704 phishing web pages in CertStream in 30 days in [55]. However, similar experiment in this study yields only 178 web pages. What might be the reason for this difference? Other issues: - Most of the figures and plots are not readable. - I find section 4 very difficult to follow. It would be better to give the results of the RQs, immediately after explaining the methods and datasets. Ethics consideration -------------------- 1. No Reasons to accept the paper --------------------------- - The method is able to detect more zero-day phishing pages compared to the 2 previous works. - Provides insights into recent phishing campaigns. - The paper uses several different datasets and performs manual verification when necessary. Reasons to not accept the paper ------------------------------- - Limited novelty: The paper does not propose a novel method/idea to detect phishing, but improves existing methods via LLMs. - Related work section is very superficial. - Evaluation only covers three previous studies. - Limitations of the work is not clearly explained. Recommended decision -------------------- 4. Reject Writing quality --------------- 4. Needs improvement Confidence in recommended decision ---------------------------------- 2. Fairly confident AuthorFeedback Response by Author [Yun Lin ] (892 words) --------------------------------------------------------------------------- We sincerely thank the reviewers for their devotion and time in our work! Given the response length, we expect that we can have more discussion in the follow-up interactive rebuttal stage. > RA-Q1 Why not feed the screenshot to LLM? We agree with the reviewer on this idea! The visual language model (VLM) in GPT-4 was not available when we were conducting the work. Nevertheless, even with VLM, it remains essential to identify the most salient UI components. We are now exploring a new work called **visual prompt**, to highlight the logo and credential-taking components on screenshots to optimize the performance of VLM. > RA-Q2 Why is the defense for prompt-injection effective? The defense is designed to achieve that **text from the webpage should be data instead of instruction to LLM**. We also assume that the designed prompt is “secret” to the attackers. Otherwise, the defense is potentially compromisable. > RA-Q3 How to recruit/compensate the security experts? We hire them at a rate of ~22 USD/hour. We select them based on the criteria of at least 2 years of cybersecurity experiences. > RA-Q4 Why not ablate on different prompts? We tried quite many prompts, empirically gravitating to this version. We can explore more in the future. > RA-Q5 What if the retrieved logo does not match the extracted logo? If it does not match, we assume that LLM is unaware of such brand knowledge. Thus, we conservatively report it as benign. > RA-Q6 A 2-D clustering for detected phishing websites? We first group phishing websites by targeted brands, then we cluster them by domain names (via name similarity). > RB-Q1 Domain alias/single-sign-on (SSO) problems We thank the reviewers for insightful questions! The domain alias problem can be addressed by introducing a popularity-verification step for PhishLLM. A phishing website can hardly be alive for over a week. Therefore, the popularity of a website (e.g. by Google engine indexing) indicates its benignity, thus further indicating a potential domain alias. For SSO services, users are redirected to the SSO provider's domain (e.g., google.com) for authentication. Thus, the domain on the SSO page is the third-party domain (e.g., google.com), which makes PhishLLM less likely to introduce FP. > RB-Q2 What if Google phishing on sites.google.com In this case, an attacker might only host Google (but not PayPal) phishing on sites.google.com. Thus, a heuristic could be used to specify that *sites.google.com is not google.com*. > RB-Q3 Missing details (1) We retrieve the logos by formulating the query as "{inferred_domain_name} logo" from a reported domain. If one of the top-k retrieved logos is matched to the webpage logo, we consider the validation successful. (2) We manually confirm that the reported phishing are actual phishing domains instead of domain aliases. (3) We agree. ## False positive can be: The logo semantics is sometimes similar to the reported target, confusing the LLM validation step, see examples at https://sites.google.com/view/phishllm/false-positives ## False negatives can be: (i) LLM is unaware of this brand (ii) The sensitive texts are in dim colors, which are hard for OCR to recognize See examples at: https://sites.google.com/view/phishllm/false-negatives > RC-Q1 Consider traditional attributes as URL/DNS? No. We thank the reviewer’s suggestion and will explore their effectivenesses in our future work. > RC-Q2 Root cause of the LLM hallucination for justifying domain validation? LLMs inherently hallucinate for their probabilistic nature, leading to potential misinformation that necessitates fact-checking for reliability. We validate the inferred domain using the Google search engine. > RC-Q3 Injecting string benign website affect the output? Thanks for raising this good example! We assume that the LLM prompt is kept confidential, which is the most effective measure against prompt-injection, to the best of our knowledge. If disclosed, as noted by the reviewer, it could be exploited. We believe the advance of LLM security research can further mitigate the issue in the future. > RD-Q1 Not working on phishing url http://jlqj.com.cn/i..**@r.*.com? PhishLLM may not detect *every* phishing website (but no technique can). But, rejecting a paper with only one example is **unfair**. The reviewer can try another example as mailhost.jcd-groupe.fr, see https://sites.google.com/view/phishllm/outlook-phishing-examples The fundamental contribution of PhishLLM lies in **decoding brand and domain knowledge from LLM as reference for cross validation**. It is not simply to replace existing functionalities of the SOTA detectors. The advance of VLM can further strengthen the brand recognition capability. We fully believe in the technical direction. We are also happy that the reviewer can raise more engineering issues after we release the code. > RD-Q2 Why not compare urlscan.io, etc. ?. We respectfully disagree. For your information, the SOTA in phishing detection has progressed to reference-based phishing detectors (RBPD) e.g., Phishpedia, robust to distribution shifts and explainable. They have been benchmarked against VirusTotal which integrates over 90 phishing detection engines (including urlscan.io and GSB). It has shown that RBPDs identify significantly more phishing webpages *in the wild*. > RD-Q3 Related work is superficial We respectfully disagree. This work is a new SOTA RBPD. We request the reviewer to raise more RBPDs if he/she insists. > RD-Q4 Missing runtime overhead We respectfully disagree. Please check Table 3 and 6, and Section 4.3. > RD-Q5 Lack of limitation analysis We respectfully disagree. Please check Section 5. > RD-Q6 Why only 178 webpages in the experiment, regarding Phishpedia? To estimate the recall in a more manageable way, we adopt experimental settings as DynaPhish instead of Phishpedia. We scan 3K webpages everyday (so that we can manually count the recall), thus reporting less webpages compared to Phishpedia. Comment @A1 by Reviewer B --------------------------------------------------------------------------- Dear authors, after the rebuttal, the reviewers agreed that the paper can be a valuable contribution for the conference, but it needs to undergo a major revision. Please find attached the list of revision criteria. Please let us know if you'd like some clarifications on something and keep in mind the conference deadlines. - The absence of a reference list can introduce many false positives if the mapping between a brand and its legitimate domains is handled by retrieving a single domain from the LLM. This aspect undermines the whole methodology and needs to be addressed. Many financial institutions provide the possibility of login in different countries and on different domains (e.g., https://international.barclays.com/important-information/log-in/ allows to reach 2 login pages representing the same institution on two different domains). This is also a common with banks and their insurance companies (i.e., same brand, different domains). Such cases are not isolated but very common. The popularity verification can provide more insights to reach a decision, but introduces a new level of complexity that needs to be handled and defined (e.g., which threshold? popular in which country?) - The point above also concerns SSO logins and subdomains. Although their handling is easier compared to the previous case, they should be mentioned in the text. - The analysis of FPs and FNs is very superficial in the current manuscript. Please expand it to let the user understand when the proposed solution fails. - Add a detailed breakdown and evaluation of both the costs and latency improve the organization of the paper and clearly state scenarios in which PhishLLM could/should be deployed in practice - Expand the evaluation to consider (a) less popular benign websites, (b) recent websites reported as phishing in public phishing feeds (this goes hand-in-hand with the FN/FP analysis above) - Add a breakdown and evaluation of the failure cases mentioned in Section 4.2 to the paper itself - Add an explanation/evaluation of the hyperparameters shown on the demo page - Improve the presentation and correct factual errors in references - Provide more details on the design of the prompt and the prompt-injection defense, and what other prompt variants were considered and didn't work well. - Add other clarifications provided in the rebuttal response. Comment @A2 --------------------------------------------------------------------------- [REC] This comment is regarding ethics. At a minimum, the researchers should report any websites they found (ideally as they find them) and comment on how this was done throughout the study. Comment @A3 by Author [Ruofan Liu ] --------------------------------------------------------------------------- We thank the reviewers for their devoted effort in improving our draft! We index your advice with planned fixing solutions for your more convenient validation. In our revision, we will prepare highlighted paragraphs with their corresponding indexes. Please kindly let us know if we can proceed with the revision accordingly. Thanks again! # R1. [More Experiment] Alias domain concern To address the concern, we will prepare an additional experiment given that the number of websites with multiple domains is small in the wild. Therefore, we plan to prepare 100 brands that have multiple benign domain aliases. For each domain, we run a “Brand Recognition Model” experiment with the popularity validation step. We evaluate the additional experiment with (1) brand recognition rate and (2) false positive rate. # R2. [More Discussion] SSO login, subdomains concern As per your kind advice, we will include SSO login and subdomain concerns in a discussion section. # R3. [More Discussion] Expand the analysis of FP and FN As per your kind advice, we will include more examples of FPs and FNs in our evaluation. # R4. [More Experiment/Discussion] Detailed breakdown of the cost and latency of PhishLLM, and how it can be deployed. As per your kind advice, we will break down all the costs of LLM interaction with a paragraph on runtime cost evaluation. Further, we will illustrate how we deploy it on the cloud in our implementation. # R5. [More Experiment] Expand evaluation to less popular benign websites and recent phishing websites To address the concern, we plan to add an experiment on Alexa's low-ranked 3k websites. Also, we plan to collect public phishing feeds from OpenPhish and Open Source Phishing Trap (https://github.com/mitchellkrogza/Phishing.Database), and accumulate to ~3k websites. We evaluate benign websites with a brand recognition rate and false positive rate; and phishing websites with false negative rates. # R6. [More Example] Evaluation of the failure cases As per your kind advice, we will illustrate the failing cases with more examples. # R7. [More Discussion] Explain the hyperparameters shown on the demo page As per your kind advice, we will explain how and why we select hyperparameters on the demo page. # R8. [More Discussion] Improve the presentation As per your kind advice, we will thoroughly go through our revision to avoid confusion and typos. # R9. [More Discussion] Provide more details on prompt design and other potential variants As per your kind advice, we will provide more clarification and detailed examples of prompt variants and the design philosophy. # R10. [More Discussion] Add clarification in the rebuttal phase. As your kind advice, we will address our concern in the rebuttal phase. Comment @A4 by Author [Yun Lin ] --------------------------------------------------------------------------- Dear Shepherd and the reviewers, Would you mind checking whether we can improve the paper based on our summarized changing plan? Thanks a lot! Comment @A5 by Shepherd --------------------------------------------------------------------------- Dear authors, Thank you for proving the list. This sounds like a comprehensive plan and I look forward to reviewing the revisions. Please maintain an emphases on R1-R5, as these items pertain to the practical/at-scale applicability of the proposed approach. This is the key requirement for acceptance. In addition, please address reviewer D's comments about related work (toward providing a concise yet a comprehensive analysis of methodologies in phishing detection as they compare to the proposed approach) as well as the request to report (and, ideally, track the response to) any new phishing websites identified during the course of the research. Comment @A6 by Author [Yun Lin ] --------------------------------------------------------------------------- Dear Shepherd, We will address your concerns accordingly. Thanks again for your kind advice and comments! Comment @A7 by Author [Ruofan Liu ] --------------------------------------------------------------------------- Dear Shepherd and reviewers, We sincerely thank the shepherd and the reviewers for their devotion to our work! We have addressed *all* your comments in your revision. To save your efforts, we highlight the key paragraphs with assigned indexes to each comment such as R1, R2, etc. The shepherd and the reviewers can search the index across the paper to see how we address each comment. You can kindly check the response letter and find the corresponding change in the revision. Comment @A8 by Author [Ruofan Liu ] --------------------------------------------------------------------------- Dear Shepherd and reviewers, Sorry for keep reminding you. However, we might need some time to prepare a visa for the conference. Thus your early decision matters a lot to us. Could you kindly check whether you are satisfied with our revision? Also, please feel free to let us know whether you need us to make extra changes to the paper. Many thanks! Comment @A9 by Shepherd --------------------------------------------------------------------------- Dear authors, Thank you for submitting the revision. I will review the changes and provide feedback in the coming days. Comment @A10 by Author [Ruofan Liu ] --------------------------------------------------------------------------- Dear Shepherd and reviewers, Looking forward to your feedback. Many thanks! Comment @A11 by Shepherd --------------------------------------------------------------------------- Dear authors, Thank you for submitting the revised version! This addresses a majority of the revision criteria, however several additional changes/clarifications are still required. Please see below for remaining concerns and suggestions: Domain Alias (3.1.4/4.3.2) - this approach is not well defined and seems prone to false negatives, i.e. in the case of phishing pages that are hosted on compromised or highly evasive infrastructure, as noted in prior literature [7, 68]. This should be acknowledged and/or mitigated by additional features, and the authors should add details on domains that may be incorrectly excluded with the addition of this step, such as on the public phishing feed study. Ideally, there should be some comparison of the results with and without this step. Please share any additional details you may have to clarify the approach. SSO Domain Redirection (3.1.4) - more details are needed on the "additional brand recognition" and "brand-domain consistency checks" mentioned in this section. Public Phishing dataset (Table 7) - what overlap or differences did you observe in the sites detected by each approach? In other words, did the same or different failure cases apply to PhishPedia & PhishIntention? Separately, given the amount of time available for the revision, it would be nice if preliminary findings toward the future work (i.e. adapting LLMs to the crypto phishing MO) could be added, instead of mentioning that it could be done. Intuitively, based on the other findings in this paper it would seem that a key strength of LLM-based detection would be simpler adaptation to different types of phishing. Presentation concerns: * The paper currently (significantly) exceeds the page limit of 18 pages including references and appendices. Please designate content that will be cut from the final version, or otherwise consolidated. I would recommend excluding observations (such as 4.5.3-4.5.5) where the findings are not part of the evaluation or framework design, as well as figures that can be adequately explained in the text. * The framing in the introduction is unchanged, yet the performance-related design aspects seem to be secondary to the improvements in recall offered by PhishLLM. I would therefore recommend trimming this discussion (to save space) and focus on incorporating insights from the revision. * Raw counts in addition to percentages would be helpful, such as the failure cases in Section 5 * I did not find details on the reporting beyond A.3. Were URLs shared directly with relevant anti-phishing entities? If not, this should be mentioned and done for future deployments of the framework. * The dependency on google APIs (such as the image/search APIs) is OK, but should be mentioned as a limitation compared to phishing detection systems without such requirements * Changes made in the revision, such as section 3.1.4, should be reflected in Figures 1/2 * "v.s." -> vs. (Section 5) * 10.3 billion -> $10.3 billion (Section 1) * SafeBrowsing -> Safe Browsing (Section 1) * benignity -> benignness (or benign nature) (Section 1) Comment @A12 by Author [Yun Lin ] --------------------------------------------------------------------------- Dear Shepherd, We sincerely thank you for your detailed comments! Those suggestions can further improve our work and presentation. We will address your comments soon. Comment @A13 by Author [Ruofan Liu ] --------------------------------------------------------------------------- Dear Shepherd and Reviewers, We sincerely thank the shepherd and reviewers for their detailed comments! We have addressed the additional suggestions in our new version of the draft. We have re-indexed and highlighted the comments as R1, R2, etc. You can kindly check the response letter and find the corresponding changes in the revision. Comment @A14 by Author [Ruofan Liu ] --------------------------------------------------------------------------- Dear Shepherd and reviewers, Given the deadline is coming, could you please check our revision, we are looking forward to your feedback. Many thanks! Comment @A15 by Shepherd --------------------------------------------------------------------------- Dear authors, Thank you for making the revisions. To address R1, please clearly state in the limitations that the popularity check at the domain level may lead to the exclusion (=false negative) of phishing pages on compromised (legitimate) websites, and that additional checks/attributes would be needed in such a scenario. I feel that the ROI justification/discussion is unnecessary and should be excluded unless the authors have evidence to quantify the difference in ROI. For example, in prior years, there was a high proportion of Wordpress-based phishing pages due to the ease at which Wordpress vulnerabilities could be detected and exploited to deploy phishing kits. At the end of the day, this limitation does not take away from the other merits of PhishLLM, and could potentially be addressed in future work, so acknowledging it will suffice. Please upload the 18-page version of the paper with this change for final approval, at which point I'll be happy to conclude the revision process. Comment @A16 by Author [Ruofan Liu ] --------------------------------------------------------------------------- Dear Shepherd and reviewers, Thank you for your feedback. We have updated the limitations section. The revised 18-page version of the paper has been attached for your final review. Many thanks! Comment @A17 by Shepherd --------------------------------------------------------------------------- Dear authors, Thank you. This concludes the shepherding process, and I will now recommend the paper for acceptance. Congratulations! You may of course continue making editorial changes to polish/rearrange the content within the 18-page limit as you see fit up until the camera-ready deadline. Comment @A18 by Author [Yun Lin ] --------------------------------------------------------------------------- Dear Shepherd and reviewers, It is a great news. We sincerely thank you for your advice and help to improve our work!