The typical approach for machine learning in this problem domain either use the raw bytes, a graph representation or raw disassembler. INL’s research found that each of these representations fail to encode all the relevant information contained within executable code, leading to shortcomings when attempting to perform more sophisticated analysis tasks via machine learning. INL’s research into using supervised learning to analyze compiled software requires an assembly like language which is platform neutral and works well for machine learning.
INL’s approach overlooks the address spaces, and instead delves into the singular operations performed within each function and block. The technology uses various assembler lifters and their respective intermediate languages to capture the underlying instruction statements. How many times each instruction appears within a block of code is counted, resulting in a vector that is able to fingerprint each function. This representation vectorizes the data into the correct forms needed by various machine learning models. This representation can be stored as text and is also suitable for use in a graph database.
Applications and Industries
• Binary analysis
• Many platforms can be supported.
• It is possible to add support for new platforms as needed.
• Allows the use of machine learning techniques and models that have not been successfully used before.