DISCLAIMER
The script has been made kind of ad-hoc using different samples that I have come across. It wasn't created for mass use but for experimenting and exploring. Odds are it will return a false positives or the output won't be of value. The script contains an analysis class that can be used to parse out bad results or apply logic if needed.
In the above image we can see that IDA found the function sub_481678 but it didn't identify the code below as a function. If we were to right click, Create function... or press 'p' at 0x04816D4 IDA would mark this block of code as a function. If we then right click Jump to xref to operand... or press X we would see the following.
If we kept up the process of creating a function, then check the cross-reference, create another function, etc, we would hit a C++ jump table. Since the initial analysis did not identify the parent function the child function was marked as undefined. This is a good example of how delicate recursive traversal can be. One function reference is missed and a whole block of functions are undefined. Typically IDA does a good job of identifying all the functions and the functions ranges. Since the initial analysis populates a core set of the functions it's easy to piggy back off the function set an verify the boundaries. Functions and perilogue have characteristics such as a perilogue has start and end address, another function resides below or above (unless first or last) and some compilers have bytes that can be used to identify boundaries of a perilogue. We can use these characteristics to identify undefined functions algorithmically or code that IDA doesn't correctly disassemble. Let's explore some of the output
addr:405748 from:down type:code Func End:405749 addr:405749 from:up type:align Func End:Unknown addr:405995 from:down type:code Func End:405996 addr:405996 from:up type:align Func End:Unknown addr:405f04 from:down type:data Func End:Unknown addr:405f25 from:up type:align Func End:Unknown addr:406158 from:down type:align-c Func End:Unknown addr:406194 from:up type:data Func End:Unknown addr:406380 from:down type:align-c Func End:40638a addr:40640d from:up type:code Func End:406410 addr:406844 from:down type:align-c Func End:406897 addr:40690b from:up type:align Func End:Unknown addr:406a14 from:down type:align-c Func End:406a42 addr:406a4b from:up type:code Func End:406a4c addr:406a8c from:down type:code Func End:406aba addr:406ac3 from:up type:code Func End:406ac4 addr:406fd4 from:down type:align-c Func End:406fda addr:406fda from:up type:align Func End:Unknown addr:4075d4 from:down type:align-c Func End:Unknown addr:408548 from:up type:data Func End:Unknown addr:408a58 from:down type:align-c Func End:Unknown
- "addr": is the address of the data/code/etc that is not in a function.
- "from": identifies the reference point of how the function was found.
- "type": identities the type of data.
- "Func End": attempts to find the end of the function by calling FindFuncEnd
The output is addr:405748 from:down type:code Func End:405749. The addr is the location of our undefined function. from is down because we used the end address 0x0405747 of the known function sub_40572C to check if the following address belongs to another known function. Since the address is marked as red we can see the code does not belong to a function. The type shows that IDA has identified the address as code. The Func End gives the estimated end of the function. If we look at the next line we can see addr:405749 from:up type:align Func End:Unknown. Once again we have the address, the from is up because we found the undefined data by checking if a function exists directly above the function sub_40574C. The type is an align byte. The Func End is Unknown because IDA did not identify the ending. In this example we could use the from:down and from:up addresses to estimate the start and end of the function.
The key word in that last sentence is estimate. This is actually where things get very interesting from a compiler exploration standpoint. Estimating the range of a function is not as easy as it sounds. The first problem that I encountered was alignment bytes. The best description on the World Wide Web on why alignment bytes are used was written by Agner Fog in Optimizing subroutines in assembly language [PDF].
"Most microprocessors fetch code in aligned 16-byte or 32-byte blocks. If an important subroutine entry or jump label happens to be near the end of a 16-byte block then the microprocessor will only get a few useful bytes of code when fetching that block of code. It may have to fetch the next 16 bytes too before it can decode the first instructions after the label. This can be avoided by aligning important subroutine entries and loop entries by 16. Aligning by 8 will assure that at least 8 bytes of code can be loaded with the first instruction fetch, which may be sufficient if the instructions are small."In the image above at address 00405749 we can see what IDA has marked as align. The three bytes are a three byte NOP. There are three patterns of alignment bytes that I have found. The most common are the bytes "90 90 90" and "CC CC CC". The byte count/length will vary depending on the size needed for alignment. These two bytes seem to be the favorite of Microsoft compilers. The third is a multi-byte no operation instruction. If the bytes "8D 40 00" from 00405749 were disassembled they would equal the instruction "lea eax, [eax+0]". Below is a list of multi-byte no operations that were found in a executable compiled with MINGW.
Size 14 8D B4 26 00 00 00 00 8D BC 27 00 00 00 00 8D B4 26 00 00 00 00 lea esi, [esi+0] 8D BC 27 00 00 00 00 lea edi, [edi+0] ---------------------------------------------------------------------- Size 12 8D B6 00 00 00 00 8D BF 00 00 00 00 8D B6 00 00 00 00 lea esi, [esi+0] 8D BF 00 00 00 00 lea edi, [edi+0] ---------------------------------------------------------------------- Size 11 8D 74 26 00 8D BC 27 00 00 00 00 8D 74 26 00 lea esi, [esi+0] 8D BC 27 00 00 00 00 lea edi, [edi+0] ---------------------------------------------------------------------- Size 10 8D 76 00 8D BC 27 00 00 00 00 8D 76 00 lea esi, [esi+0] 8D BC 27 00 00 00 00 lea edi, [edi+0] ---------------------------------------------------------------------- Size 9 89 F6 8D BC 27 00 00 00 00 89 F6 mov esi, esi 8D BC 27 00 00 00 00 lea edi, [edi+0] ---------------------------------------------------------------------- Size 8 90 8D B4 26 00 00 00 00 90 nop 8D B4 26 00 00 00 00 lea esi, [esi+0] ---------------------------------------------------------------------- Size 7 8D B4 26 00 00 00 00 90 nop 8D B4 26 00 00 00 00 lea esi, [esi+0] ---------------------------------------------------------------------- Size 6 8D B6 00 00 00 00 8D B6 00 00 00 00 lea esi, [esi+0] ---------------------------------------------------------------------- Size 5 90 8D 74 26 00 90 nop 8D 74 26 00 lea esi, [esi+0] ---------------------------------------------------------------------- Size 4 8D 74 26 00 8D 74 26 00 lea esi, [esi+0] ---------------------------------------------------------------------- Size 3 8D 76 00 8D 76 00 lea esi, [esi+0] ---------------------------------------------------------------------- Size 2 66 90 66 90 xchg ax, ax ---------------------------------------------------------------------- Size 1 90 nop
The issue with alignment bytes is it can be difficult to estimate the size. IDA only marks the first byte as align byte. All further bytes will be marked as data. It's easy to estimate the size if all the bytes are 90. Note: If IDA doesn't mark the byte as align this approach gets even more tricky. We can scan until the alignment byte is no longer found but what about multi-byte nops? The start of the function can be estimated if we calculate the optimal byte alignment address.
def calc_dist(self,addr): "calcuates the distane of the next aligned address" dist = int(addr % 16) if dist < 4: return 4 - dist if dist < 8: return 8 - dist if dist < 12: return 12 - dist if dist < 16: return 16 - dist else: return None
This would work if it's 16 byte align which is very common if not default when compiled with Visual Studio. undefinder.py would have align-c in the output if the address was calculated. In the MINGW sample the calculation would fail because the author passed -falign-functions=32 as an argument. This wouldn't be hard to modify or add some more code to check for this but it is a good example of all the little nuances that exist between different compilers.
Along with byte alignment, estimating the start and end of a function adds a nice challenge. It's even more challenging when you didn't realize there was a function named FindFuncEnd. There is a notable flaw in using the boundaries of two known functions to define the boundaries of another one. How would we know there is not more than one undefined functions in the range? Starting from the bottom of a known function and working down is the easiest path.
I tried three approaches that varied on success depending on the compilers. The first approach was to search for a return instruction followed by an align byte.
We can see the last address of the function printed in the output.
def find_align_ret(self, start, end): cur = start while(True): if 'ret' in GetMnem(cur) : if isAlign(idaapi.getFlags(NextAddr(cur))): return cur cur = NextAddr(cur) if cur >= end: return None
The second search was to calculate the inverse of the first step in the function prologue. In theory, if some register is pushed on to the stack first, it's going to be popped last.
We would then loop through each instruction until we found pop esi and then find the return. This works in most situations but the formatting and tracking of positive and negative values can be annoying.
def inverse(self,addr): inverse = "" if GetMnem(addr) == "push" and GetOpnd(addr,0) == "esp": inverse = "pop " + GetOpnd(addr,0) if GetMnem(addr) == "push" and GetOpType(addr, 0) == 1 and GetOpnd(addr,0) != "esp": inverse = "pop " + GetOpnd(addr,0) if GetMnem(addr) == "sub"and GetOpnd(addr,0) == "esp": inverse = "add " + GetOpnd(addr,0) + ", " + hex(GetOperandValue(addr,1))[2:] + 'h' if GetMnem(addr) == "add" and GetOpnd(addr,0) == "esp": inverse = "sub " + GetOpnd(addr,0) + ", " + GetDisasm(addr).split(",")[-1] if GetMnem(addr) == "mov" and GetOpnd(addr, 0) == "ebp" and GetOpnd(addr,1) == "esp": # can be used to find the end of the function prologue inverse = "mov ebp, esp" return inverse
The last search was to simply search for the hot patching bytes mov edi, edi to find the start of a function.
def test_mov_edi2_find(self, addr): 'return the address of the next mov edi, edi. Distance needs to be validated' if GetMnem(addr) == "mov" and GetOpnd(addr, 0) == "edi" and GetOpnd(addr, 1) == "edi": mov_addr = FindBinary(addr +1, SEARCH_DOWN, "8B FF") if mov_addr == BADADDR: return None else: return mov_addr else: return None
These type of searches work with specific compilers but bounds and error checking would still need to be done in order to determine if it safe range. If the code within the calculated range is simple, it might be okay to mark it as a function automatically. The cyclomatic complexity algorithm could be used to determine the simplicity of a function. The same concepts can be applied for search the start of a function or from the bottom up.
One great thing about this script is it allows you to find code that wasn't disassembled properly. It's will find small blocks of code between functions that shouldn't be there. Odds are these techniques could be used after the recursive disassembly algorithm has executed to verify the results or to add undefined functions to a post-analysis queue.
Please feel free to email (source code below), leave a comment or ping me on twitter with any questions or thoughts. Cheers.
Code
Please download the source code from it's repo. The below code is for this post only and will not be updated.
''' Name: undefinder.py Version: 0.2 Description: IDAPython script that searches for data between known functions Author: alexander<dot>hanel<at>gmail<dot>com License: undefinder.py is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>. ''' class undefinder: def __init__(self): # self.firstList contains a dictionary of # { "id":function_name, "start":function_start_addr, "end":function_end_addr self.firstList = [] self.modList = [] self.populate_mods() def getKnownFunctions(self): "self.firstList is populated with each item in a list with a dictionary of { 'id':function_name, 'start':function_start_addr, 'end':function_end_addr" for funcea in Functions( SegStart( here() ), SegEnd( here() ) ): self.firstList.append({"id":GetFunctionName(funcea), "start":GetFunctionAttr(funcea, FUNCATTR_START), "end":GetFunctionAttr(funcea, FUNCATTR_END)}) return def calc_dist(self,addr): "calcuates the distane of the next aligned address" dist = int(addr % 16) if dist < 4: return 4 - dist if dist < 8: return 8 - dist if dist < 12: return 12 - dist if dist < 16: return 16 - dist else: return None def check_down(self, func_addr): "checks if the data below is a function or other type of data" cur_func = idaapi.get_func(func_addr) cur_end = cur_func.endEA next_func = idaapi.get_next_func(func_addr) next_start = None # if next_func == None the last function has been found # the try/except is used because IDAPython was crashing on # the next_func == None comparison if the value was not None try: if next_func == None: next_start = NextAddr(cur_end) except: next_start = next_func.startEA # get start of the next function if cur_end == next_start: return None, None # check if the block is already part of a function if GetFunctionName(cur_end) != "": return None, None if isAlign(idaapi.getFlags(cur_end)): # need to check for two types of alignment. The first is the 90 90.. or other recurring pattern # if the first two bytes don't match we can calculate the size of the alignment. abyte_check = idaapi.get_many_bytes(cur_end,2) # check if bytes match and if not 0. This could be changed to only check for 90 or CC if abyte_check[0] == abyte_check[1] and abyte_check[0] != "\x00": alignByte = idaapi.get_many_bytes(cur_end,1) # loop through align bytes till end while idaapi.get_many_bytes(cur_end,1) == alignByte: cur_end += 1 if GetFunctionName(cur_end + 1) != "": return None, None else: return cur_end, "align" else: distance = self.calc_dist(cur_end) # we can calculat the distance cur_end += distance if GetFunctionName(cur_end) != "": return None, None else: return cur_end, "align-c" # this is a sketchy check based off of abyte_check = idaapi.get_many_bytes(cur_end,2) # check if bytes match and if not 0. This could be changed to only check for 90 or CC if abyte_check[0] == abyte_check[1] and abyte_check[0] != "\x00": alignByte = abyte_check[0] while idaapi.get_many_bytes(cur_end,1) == alignByte: cur_end += 1 # prev addr equals start of the next function if GetFunctionName(cur_end + 1) != "": return None, None return cur_end, "align" # check if byte is code if isCode(idaapi.getFlags(cur_end)) and GetFunctionName(cur_end) == "": return cur_end, "code" # check if byte is data if isData(idaapi.getFlags(cur_end)): return cur_end, "data" # check if address is valid if cur_end == BADADDR: return None, None else: return cur_end, "unknown" def check_up(self, func_addr): cur_func = idaapi.get_func(func_addr) cur_start = cur_func.startEA prev_addr = PrevHead(cur_start) if prev_addr == BADADDR: return None, None # get start of the next function if GetFunctionName(prev_addr) != "": return None, None # quick check to see if from the end of the function # is an align byte tmp_prev = prev_addr for dist in range(0, 15): tmp_prev -= dist if isAlign(idaapi.getFlags(tmp_prev)): if GetFunctionName(tmp_prev-1) != "": return None, None else: return tmp_prev, "align" # check for align bytes that are static such as CC CC or 90 90 abyte_check = idaapi.get_many_bytes(prev_addr,2) if abyte_check != None: if abyte_check[0] == abyte_check[1] and abyte_check[0] != "\x00": alignByte = abyte_check[0] while idaapi.get_many_bytes(prev_addr,1) == alignByte: prev_addr -= 1 # prev_addr equals end of the previous function if GetFunctionName(PrevHead(prev_addr)) != "": return None, None else: return prev_addr, "align" # check if byte is code if isCode(idaapi.getFlags(prev_addr)) and GetFunctionName(prev_addr) == "": return prev_addr, "code" # check if byte is data if isData(idaapi.getFlags(prev_addr)): return prev_addr, "data" # check if byte is ascii if isASCII(idaapi.getFlags(prev_addr)): return prev_addr, "ascii" # check if address is valid if prev_addr == BADADDR: return None, None else: # Warning: replace "prev_addr" with "None" if large amounts of FPs # Calculating the beginning of align bytes isn't the easiet thing. return prev_addr, "unknown" def populate_mods(self): self.getKnownFunctions() for addr in self.firstList: up = self.check_up(addr['start']) if None not in up: self.modList.append(["up", up[0], up[1]]) down = self.check_down(addr['start']) if None not in down: self.modList.append(["down", down[0], down[1]]) def print_all(self): # { "id":function_name, "start":function_start_addr, "end":function_end_addr for block in self.modList: end = FindFuncEnd(block[1]) if end == BADADDR: print "addr:%-8x from:%-4s type:%-5s Func End:Unknown" % (block[1],block[0], block[2]) else: print "addr:%-8x from:%-4s type:%-5s Func End:%x" % (block[1],block[0], block[2], end) class analyze: # func = idaapi.get_next_func(here()) # block[0] is the direction; block[1] is the address, block[2] is the type def analysis_group(self, modList): # TO CALL # k = f.analyze # k.analysis_group(f.modlist) # align, code, data, ascii, align-c, unknown for block in modList: if block[0] == "up" and block[2] == "align": print 'called' if block[0] == "down" and block[2] == "align": pass if block[0] == "up" and block[2] == "align-c": pass if block[0] == "down" and block[2] == "align-c": # if down align-c = pointer will be at the start of the code pass if block[0] == "up" and block[2] == "code": pass if block[0] == "down" and block[2] == "code": pass if block[0] == "up" and block[2] == "data": pass if block[0] == "down" and block[2] == "data": pass if block[0] == "up" and block[2] == "ascii": pass if block[0] == "down" and block[2] == "ascii": pass if block[0] == "up" and block[2] == "unknown": pass if block[0] == "down" and block[2] == "unknown": pass def test_mov_edi2_find(self, addr): 'return the address of the next mov edi, edi. Distance needs to be validated' if GetMnem(addr) == "mov" and GetOpnd(addr, 0) == "edi" and GetOpnd(addr, 1) == "edi": mov_addr = FindBinary(addr +1, SEARCH_DOWN, "8B FF") if mov_addr == BADADDR: return None else: return mov_addr else: return None def inverse(self,addr): inverse = "" if GetMnem(addr) == "push" and GetOpnd(addr,0) == "esp": inverse = "pop " + GetOpnd(addr,0) if GetMnem(addr) == "push" and GetOpType(addr, 0) == 1 and GetOpnd(addr,0) != "esp": inverse = "pop " + GetOpnd(addr,0) if GetMnem(addr) == "sub"and GetOpnd(addr,0) == "esp": inverse = "add " + GetOpnd(addr,0) + ", " + hex(GetOperandValue(addr,1))[2:] + 'h' if GetMnem(addr) == "add" and GetOpnd(addr,0) == "esp": inverse = "sub " + GetOpnd(addr,0) + ", " + GetDisasm(addr).split(",")[-1] if GetMnem(addr) == "mov" and GetOpnd(addr, 0) == "ebp" and GetOpnd(addr,1) == "esp": # can be used to find the end of the function prologue inverse = "mov ebp, esp" return inverse def find_align_ret(self, start, end): cur = start while(True): if 'ret' in GetMnem(cur) : if isAlign(idaapi.getFlags(NextAddr(cur))): return cur cur = NextAddr(cur) if cur >= end: return None if __name__ == '__main__': f = undefinder() f.populate_mods() f.print_all()
References or good reads.
http://reverseengineering.stackexchange.com/questions/2347/what-is-the-algorithm-used-in-recursive-traversal-disassembly
https://jdebp.eu./FGA/function-perilogues.html
http://www.agner.org/optimize/optimizing_assembly.pdf
http://www.phrack.com/issues.html?issue=66&id=14
Hey again Alexander,
ReplyDeleteI did somewhat similar solve with my "ExtraPass" plug-in:
http://www.macromonkey.com/bb/viewtopic.php?f=65&t=703
I thought at first to use patterns too as a lot of epilog and prolog code appears to be consistently paternable.
But then most of that goes out the door if the target is compiled with some settings that omit call frames, and, or, profile guided link option, etc.
And then problems with Borland compiled executables (that you see less and less of now of days anyhow) that have a lot of data inlined with
code.
I ended up with something a lot more sloppier (yet still works well) that errors on the side code since I assumed it would be better to have more intact functions (vs some fouled up switch statement table data, etc).
After all what good is data from a missing function if you can't even see the function in the first place?
You can always fix up that data manually for a function once you find it (the significant ones).
I look at each gap between existing functions and if doesn't follow a data pattern I assume it's code and try to convert it as such then see if I can place a function there too.
And with alignment, also in those gaps I just look for a run of 0x90 or 0xCC's.
The biggest IDA function problem pattern appear to be those that have embedded exception handlers . These functions that don't return, etc.
IMHO something you are missing form your script here is to handle these big blocks of falsely declared data blocks (in the ".text" segments) that are actually code.
Another occasional IDA hiccup you will see these big "DB" or "DD" declared blocks out of the blue that typically contain one or even many functions in them.
In particular a problem in a lot of targets that have disconnected functions (like those derived from C++). They have no apparent entry point ref so IDA has a harder time seeing entry points.
One of the default optional passes in my plug-in fixes these by attempting to turn these block into code.
I like your more precise pattern matching. I could use this to get more accuracy on function boundaries, etc.
Thanks for the comments.
ReplyDelete> And with alignment, also in those gaps I just look for a run of 0x90 or 0xCC's.
I'd recommend checking for a JMP before the 0x90 or 0xCCs. Alignment bytes can be used in the middle of a perilogue to align JMPs.
> IMHO something you are missing form your script here is to handle these big blocks of falsely declared data blocks (in the ".text" segments) that are actually code.
Just a note, relying on the ".text" segment will not work on position independent code such as shellcode or code allocated on the heap without a PE header.
The capability to find the data is there. The hit would look something like this "addr:408548 from:up type:data Func End:Unknown". There are scenarios when the data wouldn't be found such as if no known functions sharing a boundaries with the data. It would be up to whoever is executing the script to come up with the logic to determine if it's valid code or not. In my testing determining if data is actually valid code is not trivial. There are the low hanging fruits of searching for byte codes or instructions but that only works on a small set. A full X86 assembler front end compiler would need to be written in order to determine if the instructions are valid or not.
> disconnected functions (like those derived from C++).
Yes, I came across those also. If you have a large set of those in a small range of addresses you can flag those as jump table or something to come back to.
As I stated this code is more of framework for exploring. There are far too many compiler and compiler options to be able to automatically find every undefined functions. I only needed a "good enough" approach.