Insider threats are potentially harmful acts carried out by authorized personnel, both intentionally (sabotage, intellectual property theft) and unintentionally (careless use of computing resources). Recent reports show that 53 percent of organizations and 42 percent of U.S. federal agencies are suffering from insider threats every year. As malicious insiders’ activities make up only a small portion of user activities in a wide range of domains (web and file access, email history), they are difficult to detect and distinguish from legitimate ones.
Le and Zincir-Heywood developed and tested insider threat detection systems based on supervised machine learning algorithms. The systems were developed by training the algorithms on an initial set of malicious and normal users’ data, as confirmed by the manual findings of security analysts. Each system made use of activity log data streams as background for data analysis. These streams included web history, email log and file access, as well as organization structure and user information. This data was readied for analysis by extracting the specific data features related to frequency and file statistics such as the number of emails sent and file size.
Systems using Logistic Regression, Random Forest and Artificial Neural Network algorithms were compared using the publicly available insider threat dataset from CERT, simulating an organization with 2000 employees over a period of 18 months. The training data was limited to a maximum of 400 identified “normal” and “malicious” users from only the first 37 weeks, to simulate real-word environments where such data is obtained from a restricted set of sources.
There appear to be trade-offs between the detection rate of all malicious threats and the precision with which the threats are identified; precise results excluding false flags. Logistic Regression achieved high detection rates, while suffering from low precision. The Random Forest algorithm showed very good precision while the Artificial Neural Network gave better malicious insider detection rates in most of the cases tested. One of the four tested insider threat scenarios was shown to be the hardest to detect in all cases. This scenario entailed company insiders surfing job websites and soliciting employment from a competitor, then using a thumb drive to steal company data before leaving. Those malicious actions were performed over a period of 2 months on average, making them hard to detect.
This study shows that a machine learning detection system was able to detect malicious insiders with only limited system training. Different algorithms have different strengths. For example, the high precision Random Forest algorithm can be employed where monitoring resources are limited, whereas the Artificial Neural Network provides higher detection rates, albeit with more false alarms.
It is possible for machine learning algorithms to detect malicious insider threat activity.