%IDA_DIR%\idaw.exe -A -Sidb2jsin.py bad.(exe|dll|sys|etc)idb2jsin will do the following steps to the IDB. For each functions in our IDB we walk through each line of code, normalize the code, count the occurrence of the normalized code and then save those as a dictionary in the jsin. The output will be a saved in the working directory with a name of MD5.jsin. The MD5 is the hash of the original analyzed file. Note: If we know the file belongs to a particular family of malware we should rename the MD5.jsin to "malware-family.jsin". This will help with the identification. Later we will be using the count and normalized code as vectors to compare using cosine similarity. A nice side effect of the normalization is we do not have to worry about rebuilding the import tables because the addresses are removed. This is useful if we wanted to work with dumped executables. The format for the "jsin" files are json but the extension has been changed to make it more unique in name. By having a unique file extension we can scan whole directories to compare files.
Disclaimer
Using cosine similarity to compare executable code is nothing new or original on my part. There are academic papers dating back to 2004 discussing this technique. Also, Halvar Flake was presenting similar techniques during this time frame for BinDiff.
The second script is cospare.py. The name is a play on the words "cosine" and "compare". The script will compare two ".jsin" files to see how they match. This script does not rely on IDAPython and is executed via the command line. The script compares each function in a jsin file to each function in another jsin file. There are four modes/options. The first is comparing two jsin files. No arguments. The second is -s or --simple. It will state "yes" or "no" if the files match. The third will display all functions that match. This is -v or --verbose. The last mode will scan a .jsin against a directory. The script will search for all files that contain the jsin extension and then compare it. Below we can see an example of each option and it's output.
C:\Users\___\_____\Projects\compare\cospare.py a.jsin b.jsin Total Function count in a.jsin: 286 Total Function count in b.jsin: 328 Total Matches found 225 Overall function matches 78.67% C:\Users\___\_____\Projects\compare\cospare.py -s a.jsin b.jsin yes C:\Users\___\_____\Projects\compare\cospare.py -v a.jsin b.jsin sub_100022B4 matches sub_1000275A 99.08% with a size difference 0.00% sub_10005575 matches sub_10005391 100.00% with a size difference 0.00% sub_10006BDC matches sub_100069E7 99.38% with a size difference 0.00% sub_1000E5E4 matches sub_1000E510 100.00% with a size difference 0.00% sub_10007770 matches sub_1000757B 99.11% with a size difference 0.00% sub_1000CC41 matches sub_1000CB6D 100.00% with a size difference 0.00% sub_10009DE6 matches sub_10009D12 99.71% with a size difference 0.00% sub_1000D213 matches sub_1000D13F 98.57% with a size difference 0.00% sub_10005DBF matches sub_10005BD9 100.00% with a size difference 0.00% sub_1000C588 matches sub_1000C4B4 100.00% with a size difference 0.00% sub_1000CA28 matches sub_1000C954 100.00% with a size difference 0.00% sub_1000864B matches sub_100081BB 99.05% with a size difference 2.90% sub_1000864B matches sub_100083B5 99.76% with a size difference 0.00% sub_10009A50 matches sub_1000997C 100.00% with a size difference 0.00% sub_1000A556 matches sub_1000A482 99.28% with a size difference 0.00% sub_1000C44E matches sub_1000C37A 100.00% with a size difference 0.00% sub_100077EB matches sub_100075F6 99.42% with a size difference 0.00% sub_10003D67 matches sub_1000420D 98.97% with a size difference 0.00% ....... sub_1000E503 matches sub_1000E42F 100.00% with a size difference 0.00% sub_1000543B matches sub_1000525A 98.45% with a size difference 0.00% sub_1000A78C matches sub_1000A6B8 100.00% with a size difference 0.00% sub_100089F8 matches sub_10008766 98.04% with a size difference 6.67% sub_1000DC86 matches sub_1000DBB2 97.67% with a size difference 0.00% Total Function count in a.jsin: 286 Total Function count in b.jsin: 328 Total Matches found 225 Overall function matches 78.67% C:\Users\___\_____\Projects\compare\tree Folder PATH listing Volume serial number is _____-______ C:. +---New folder +---a +---b +---c C:\Users\___\_____\Projects\compare\cospare.py -m a.jsin . a.jsin matches .\b.jsin a.jsin matches .\New folder\a\b.jsin a.jsin matches .\New folder\c\b.jsin
2nd Disclaimer
As always use at your own risk. I do not have a background in mathematics or statistics. Some of the values were chosen because they 'felt' right rather than actually being right. Odds are there are some minor bugs but overall the code works. I have had good success with my sample set. I will add all updates to the bit-bucket repo. The code is free game to use for non-commercial use. Commercial use will need to buy me a book from my Amazon Wish List. Seems like a fair trade to me ;)
For any questions, concerns or thoughts please leave a comment, shoot me an email.
BitBucket Repo - LINK
idb2jsin.py
######################################################################## # Created by Alexander Hanel <alexander.hanel<at>gmail<dot>com> # Version: 1.0 # Data: November Something 2012 # This is file is part of cospare.py. A tool that is used for comparing # microsoft executable functions using normalization of x86 assembly and # cosine similiarity. The script reliese on IDA to create the ".jsin" # output. The output file is then compared to another output file. If # there are matches in functions they will be added to the count. This # script (idb2jsin.py) is to be passed to IDA via the command line to # create the output MD5.jsin file. The extension '.jsin' is a json file. # The extenstion is unique so it can be used when scanning a directory. # The scanned executable does not need to have the import table rebuilt. # To create the output run the following command line the PE file. # Command line option # %IDA_DIR%\idaw.exe -A -Sidb2jsin.py######################################################################## import idautils import idc import idaapi from itertools import izip import json class Parse(): def __init__(self): self.ea = ScreenEA() self.opTypes = { 0:'', 2:'o_mem', 3:'o_phrase', 4:'o_displ', 6:'o_far', 7:'o_near'} self.function_eas = [] self.getFunctions() def instructionCount(self, instructionList): 'gets the unique count of each instruction line in a function' count = {} for mnem in instructionList: if mnem in count: count[mnem] += 1 else: count[mnem] = 1 # returns dictionary { sub_func { unique_norm_intruction1: count_value, unique_norm_intruction2: count_value2}} return count def getFunctions(self): 'get a lit of function addresses' for func in idautils.Functions(): # Ignore Library Code flags = GetFunctionFlags(func) if flags & FUNC_LIB: continue self.function_eas.append(func) def getInstructions(self, function): 'get all instruction in a function' buff = [] for x in FuncItems(function): buff.append(self.normalize(x)) return buff def normalize(self, i_ea): 'Normalize the instructions' line = '' op1 = GetOpType(i_ea, 0) op2 = GetOpType(i_ea, 1) if self.opTypes.get(op1): op1 = self.opTypes.get(op1) else: op1 = GetOpnd(i_ea, 0) if self.opTypes.get(op2): op2 = self.opTypes.get(op2) else: op2 = GetOpnd(i_ea, 1) return GetMnem(i_ea) + ' ' + op1 + ' ' + op2 def run(self): 'start' funcBuffer = [] jsonDict = [] md5 = GetInputFileMD5() jsonDict.append('MD5') jsonDict.append(md5) for func in self.function_eas: jsonDict.append(GetFunctionName(func)) fun = idaapi.get_func(func) # get instructions of a function funcBuffer = self.getInstructions(fun.startEA) funcCount = self.instructionCount(funcBuffer) jsonDict.append(funcCount) # convert list to dict # source: http://stackoverflow.com/a/4576128 tmp = iter(jsonDict) jsonDict = dict(izip(tmp,tmp)) out = open(md5 + '.jsin', 'wb') # dump dict to j json.dump(jsonDict,out) out.close() if __name__ == '__main__': idaapi.autoWait() x = Parse() x.run() idc.Exit(0)
cospare.py
#!/usr/bin/python ######################################################################## # Created by Alexander Hanel <alexander.hanel<at>gmail<dot>com> # Version: 1.0 # Data: November Something 2012 # This is file is part of cospare.py. A tool that is used for comparing # microsoft executable functions using normalization of x86 assembly and # cosine similiarity. This script relies on the output from idb2jsin.py # For usage information please execute the script. ######################################################################### from math import sqrt import sys import json from optparse import OptionParser import os import fnmatch class coSim(): def __init__(self, ajson, bjson): # shhhhh... # http://www.youtube.com/watch?v=U6dxYka2tRk self.a = self.loadJsons(ajson) self.b = self.loadJsons(bjson) self.matches = [] self.count = 0 self.findMatches() def loadJsons(self, _json): 'load the json file into memory' with open(_json, 'rb') as f: try: loadedJson = json.load(f) except: print 'Error: JSON unload failed' return None return loadedJson def scalar(self, collection): # Source https://gist.github.com/288282 total = 0 for coin, count in collection.items(): total += count * count return sqrt(total) def similarity(self, A,B): # A and B are coin collections # Source https://gist.github.com/288282 total = 0 for kind in A: # kind of coin if kind in B: total += A[kind] * B[kind] return float(total) / (self.scalar(A) * self.scalar(B)) def differenceSize(self,A,B): aLen = float(len(A)) bLen = float(len(B)) di = 0.0 if aLen < bLen: di = bLen/aLen - 1 else: di = aLen/bLen - 1 return di def findMatches(self): 'finds matching functions' # the md5 is present if needed, must be deleted though del self.a['MD5'] del self.b['MD5'] for k, v in self.a.iteritems(): # functions with less than five instructions are prone to false positives if len(v) < 5: continue for key, value in self.b.iteritems(): if len(value) < 5: continue diff = float(self. differenceSize(v, value)) if diff > .15: continue o = self.similarity(v, value) # similarity percent can be adjusted if o > 0.95: if k != key: formatted = "{0:.2%}".format(diff) simm = "{0:.2%}".format(o) self.matches.append('%s matches %s %s with a size difference %s' % (k,key,simm,formatted)) self.count += 1 class dirdir(): def __init__(self, dirArg): self.paths = [] self.directory = dirArg self.findJsin() def findJsin(self): 'get the path of all files' for root, dirs, files in os.walk(self.directory): for basename in files: if fnmatch.fnmatch(basename, '*.jsin'): self.paths.append(os.path.join(root,basename)) if __name__ == '__main__': # yeah this area is a little ugly. I wasted more time thinking about the flow # then I did coding the whole program. After a certain point I just gave up # and started coding. Comments are welcome. parser = OptionParser() parser = OptionParser() # setup command line options parser.add_option('-s', '--simple', action='store_true', dest='simple', help='Displays if the files match with yes or no output') parser.add_option('-v', '--verbose', action='store_true', dest='verbose', help='will display all the functions that match') parser.add_option('-m', '--multiple', action='store_true', dest='multiple', help='Attempts to recursively compare all jsin in a dir, -m single.json <path>') (options, args) = parser.parse_args() # check to make sure we have correct arguments if len(args) == 0: parser.print_help() sys.exit() # check args for and varaibles for -m or --matches if options.multiple != None: if len(args) == 2: # Get a list of paths that contain *.jsin in the file name d = dirdir(args[1]) jsinPaths = d.paths for jsin in jsinPaths: sim = coSim(args[0], jsin) if float(sim.count)/float(len(sim.a)) > .65: print "%s matches %s" % (args[0], jsin) sys.exit() else: parser.print_help() sys.exit() # validate the each argument is an accessible object for file in args: f = 0 try: with open(file) as f: pass except IOError as e: print "%s" % e parser.print_help() sys.exit() # validate args and compare two files if len(args) == 2: sim = coSim(args[0], args[1]) else: parser.print_help() sys.exit() # if verbose print all the matches if options.verbose != None: for match in sim.matches: print match # validate args and compare that 65% of file1 is similar to file2 if options.simple != None: if float(sim.count)/float(len(sim.a)) > .65: print 'yes' sys.exit() else: print 'no' sys.exit() else: # default print about the files print "Total Function count in %s: %s" % (args[0],len(sim.a)) print "Total Function count in %s: %s" % (args[1], len(sim.b)) print "Total Matches found %s" % sim.count print "Overall function matches %s" % "{0:.2%}".format(float(sim.count)/float(len(sim.a)))
Hi,
ReplyDeleteNice post! Only one comment. You wrote "does not seem to be many tools in the public realm". You may check a Pyew script I wrote (gcluster.py [1]) which is distributed with Pyew by default [2].
[1] https://code.google.com/p/pyew/source/browse/gcluster.py
[2] http://pyew.googlecode.com