Phishing is a large problem for computer security. It is a computer attack that involves the use of messages designed to get people to do things for the benefit of an attacker. Effectively dealing with phishing requires a clear and deep understanding of the messages in order to effectively identify and avoid them.
Cui and a team combining industry and academia worked to better understand this type of attack by focusing on phishing websites. The researchers collected nearly 20,000 websites to study by downloading pages each day for 10 months. Rather than comparing phishing websites to legitimate websites or looking at the characteristics intrinsic to phishing messages, the researchers compared the phishing websites with each other. They wanted to see if they could find common elements in the malicious webpages. The researchers compared the aspects of the tag elements in the webpage code to find structurally similar pages. They then grouped the similar websites together.
The research team compared a total of 19,066 phishing sites. As expected, most of the websites had a short lifetime. Surprisingly, 20% of the studied websites lasted for more than a month. Comparing the content of phishing websites provided useful insights. The changes attackers make to phishing websites to avoid detection are often subtle, such as switching the page to a different domain or subdomain. They are less likely to change the content itself of pages. In fact, the researchers found that 90% of the pages were repeats of an earlier attack.
By classing pages by content, it is possible to see that although instances of phishing change very rapidly, their classes are used over a much longer period. It is likely that recreating the websites or messages is more demanding and costly than creating new instances at varied domains. Detecting phishing based on attack classes could possibly shift the burden of change needed to avoid detection so that it is more costly for attackers.
Detecting phishing messages based on the structure of the content in the pages could provide an additional screening tool. This method could effectively identify a large proportion of phishing pages. It overcomes shortcomings of filtering based on a URL or domain name. Requiring extensive modification of phishing page content increases the difficulty for attackers and could slow down their rate of adaptation to filtering technologies.
Detection methods that focus on page content could shift the burden on criminals to make avoiding detection more costly.