Reverse engineering is a time consuming process that involves inspecting computer program code (called assembly code) and determining its function. Analyzing this assembly code is a critical process for detecting software plagiarism and software patent infringements, identifying vulnerabilities and determining the purpose of programs when the source code is unavailable. Reusing program code is common and unregulated. Pieces of code are often reused as they are or modified slightly and then combined to make new programs. Using code in this way generates similar or identical pieces of assembly code or fragments of cloned code. Automating the detection and classification of cloned code is a great help to reverse engineers as it saves them time and limits the amount of code functions they have to decipher manually. Existing approaches to clone searching have focused only on search accuracy. However, in practical applications there are multiple factors that are important for a search tool, such as the speed and responsiveness to the tool.
Ding, Fung and Charland identified a number of challenges for a practical clone search tool. These included the interpretability and usability; efficiency and scalability; ability to update the code library incrementally and the clone search quality provided by the tool. They designed a tool that meets these challenges. The tool makes the search results easier to understand and use by providing subgraph clones as results. This shows small sections of code that are similar to or the same as pieces of code with understood functions. This helps reverse engineers analyze assembly code because they can quickly identify known functions and concentrate on the unknown sections.
The researchers implemented a search method, known as adaptive locality sensitive hashing (ALSH) that is efficient in searching through assembly code. They also applied the first approach that integrates both an inexact assembly code search with a subgraph search to provide high quality search results. They implemented the search tool on the ‘Big Data’ technologies of MapReduce and the Apache Spark computational framework. To test the system they constructed a labeled one-to-many assembly code clone dataset to allow for benchmarking. This dataset is also available to the research community.
The researchers have demonstrated a solution to help reverse engineers analyse assembly code. They created a clone search engine that is accurate, practical and scalable. It can help reduce the amount of time and effort required to analyse assembly code and allow the engineers to focus on deciphering new elements of code. They have made the clone search and the benchmarking dataset available as open source tools.
Open-source, Big Data search tools can save time and increase reverse engineer performance.