Exploring Functions with Undefinder.py

undefinder.py is an IDAPython script that can be used to find data, code, alignment bytes or ASCII in undefined functions.  My original motivation was to create a script that would automatically create functions of undefined functions. After experimenting with executables compiled with different compilers I realized an accurate "click and I'm done" approach was not feasible.  The following is a random post of patterns, notes, code and observations that I observed. 

DISCLAIMER
The script has been made kind of ad-hoc using different samples that I have come across. It wasn't created for mass use but for experimenting and exploring. Odds are it will return a false positives or the output won't be of value. The script contains an analysis class that can be used to parse out bad results or apply logic if needed.

In the above image we can see that IDA found the function sub_481678 but it didn't identify the code below as a function.  If we were to right click, Create function... or press 'p' at 0x04816D4 IDA would mark this block of code as a function. If we then right click Jump to xref to operand... or press X we would see the following.
If we kept up the process of creating a function, then check the cross-reference, create another function, etc, we would hit a C++ jump table. Since the initial analysis did not identify the parent function the child function was marked as undefined. This is a good example of how delicate recursive traversal can be. One function reference is missed and a whole block of functions are undefined. Typically IDA does a good job of identifying all the functions and the functions ranges. Since the initial analysis populates a core set of the functions it's easy to piggy back off the function set an verify the boundaries. Functions and perilogue have characteristics such as a perilogue has start and end address, another function resides below or above (unless first or last) and some compilers have bytes that can be used to identify boundaries of a perilogue. We can use these characteristics to identify undefined functions algorithmically or code that IDA doesn't correctly disassemble. Let's explore some of the output

addr:405748   from:down type:code  Func End:405749
addr:405749   from:up   type:align Func End:Unknown
addr:405995   from:down type:code  Func End:405996
addr:405996   from:up   type:align Func End:Unknown
addr:405f04   from:down type:data  Func End:Unknown
addr:405f25   from:up   type:align Func End:Unknown
addr:406158   from:down type:align-c Func End:Unknown
addr:406194   from:up   type:data  Func End:Unknown
addr:406380   from:down type:align-c Func End:40638a
addr:40640d   from:up   type:code  Func End:406410
addr:406844   from:down type:align-c Func End:406897
addr:40690b   from:up   type:align Func End:Unknown
addr:406a14   from:down type:align-c Func End:406a42
addr:406a4b   from:up   type:code  Func End:406a4c
addr:406a8c   from:down type:code  Func End:406aba
addr:406ac3   from:up   type:code  Func End:406ac4
addr:406fd4   from:down type:align-c Func End:406fda
addr:406fda   from:up   type:align Func End:Unknown
addr:4075d4   from:down type:align-c Func End:Unknown
addr:408548   from:up   type:data  Func End:Unknown
addr:408a58   from:down type:align-c Func End:Unknown  
  • "addr": is the address of the data/code/etc that is not in a function.
  • "from": identifies the reference point of how the function was found. 
  • "type":  identities the type of data.
  • "Func End": attempts to find the end of the function by calling FindFuncEnd 
If we were to look at the code at "addr" 0x0405748 (first line in output) we would see the following.

The output is addr:405748   from:down type:code  Func End:405749. The addr is the location of our undefined function.  from is down because we used the end address 0x0405747 of the known function sub_40572C to check if the following address belongs to another known function. Since the address is marked as red we can see the code does not belong to a function. The type shows that IDA has identified the address as code. The Func End gives the estimated end of the function. If we look at the next line we can see addr:405749   from:up   type:align Func End:Unknown. Once again we have the address, the from is up because we found the undefined data by checking if a function exists directly above the function sub_40574C. The type is an align byte. The Func End is Unknown because IDA did not identify the ending. In this example we could use the from:down and from:up addresses to estimate the start and end of the function.

The key word in that last sentence is estimate. This is actually where things get very interesting from a compiler exploration standpoint. Estimating the range of a function is not as easy as it sounds. The first problem that I encountered was alignment bytes. The best description on the World Wide Web on why alignment bytes are used was written by Agner Fog in Optimizing subroutines in assembly language [PDF].
"Most microprocessors fetch code in aligned 16-byte or 32-byte blocks. If an important subroutine entry or jump label happens to be near the end of a 16-byte block then the microprocessor will only get a few useful bytes of code when fetching that block of code. It may have to fetch the next 16 bytes too before it can decode the first instructions after the label. This can be avoided by aligning important subroutine entries and loop entries by 16. Aligning by 8 will assure that at least 8 bytes of code can be loaded with the first instruction fetch, which may be sufficient if the instructions are small."
In the image above at address 00405749 we can see what IDA has marked as align. The three bytes are a three byte NOP. There are three patterns of alignment bytes that I have found. The most common are the bytes "90 90 90" and "CC CC CC". The byte count/length will vary depending on the size needed for alignment. These two bytes seem to be the favorite of Microsoft compilers. The third is a multi-byte no operation instruction. If the bytes "8D 40 00" from 00405749 were disassembled they would equal the instruction "lea eax, [eax+0]".  Below is a list of multi-byte no operations that were found in a executable compiled with MINGW.


Size 14
8D B4 26 00 00 00 00 8D BC 27 00 00 00 00

8D B4 26 00 00 00 00 lea esi, [esi+0]
8D BC 27 00 00 00 00 lea edi, [edi+0]
----------------------------------------------------------------------
Size 12
8D B6 00 00 00 00 8D BF 00 00 00 00

8D B6 00 00 00 00 lea esi, [esi+0]
8D BF 00 00 00 00 lea edi, [edi+0]
----------------------------------------------------------------------
Size 11
8D 74 26 00 8D BC 27 00 00 00 00

8D 74 26 00 lea esi, [esi+0]
8D BC 27 00 00 00 00 lea edi, [edi+0]
----------------------------------------------------------------------
Size 10
8D 76 00 8D BC 27 00 00 00 00

8D 76 00 lea esi, [esi+0]
8D BC 27 00 00 00 00 lea edi, [edi+0]
----------------------------------------------------------------------
Size 9
89 F6 8D BC 27 00 00 00 00

89 F6 mov esi, esi
8D BC 27 00 00 00 00 lea edi, [edi+0]
----------------------------------------------------------------------
Size 8
90 8D B4 26 00 00 00 00

90 nop
8D B4 26 00 00 00 00 lea esi, [esi+0]
----------------------------------------------------------------------
Size 7
8D B4 26 00 00 00 00

90 nop
8D B4 26 00 00 00 00 lea esi, [esi+0]
----------------------------------------------------------------------
Size 6
8D B6 00 00 00 00

8D B6 00 00 00 00 lea esi, [esi+0]
----------------------------------------------------------------------
Size 5
90 8D 74 26 00

90 nop
8D 74 26 00 lea esi, [esi+0]
----------------------------------------------------------------------
Size 4
8D 74 26 00

8D 74 26 00 lea esi, [esi+0]
----------------------------------------------------------------------
Size 3
8D 76 00

8D 76 00 lea esi, [esi+0]
----------------------------------------------------------------------
Size 2
66 90

66 90 xchg ax, ax
----------------------------------------------------------------------
Size 1
90 nop

The issue with alignment bytes is it can be difficult to estimate the size. IDA only marks the first byte as align byte. All further bytes will be marked as data. It's easy to estimate the size if all the bytes are 90. Note: If IDA doesn't mark the byte as align this approach gets even more tricky. We can scan until the alignment byte is no longer found but what about multi-byte nops? The start of the function can be estimated if we calculate the optimal byte alignment address.

 def calc_dist(self,addr):
        "calcuates the distane of the next aligned address" 
        dist = int(addr % 16)
        if dist < 4:
            return 4 - dist
        if dist < 8:
            return 8 - dist
        if dist < 12:
            return 12 - dist
        if dist < 16:
            return 16 - dist
        else:
            return None

This would work if it's 16 byte align which is very common if not default when compiled with Visual Studio. undefinder.py would have align-c  in the output if the address was calculated. In the MINGW sample the calculation would fail because the author passed  -falign-functions=32 as an argument. This wouldn't be hard to modify or add some more code to check for this but it is a good example of all the little nuances that exist between different compilers.

Along with byte alignment, estimating the start and end of a function adds a nice challenge.  It's even more challenging when you didn't realize there was a function named FindFuncEnd. There is a notable flaw in using the boundaries of two known functions to define the boundaries of another one. How would we know there is not more than one undefined functions in the range? Starting from the bottom of a known function and working down is the easiest path.

I tried three approaches that varied on success depending on the compilers. The first approach was to search for a return instruction followed by an align byte.


We can see the last address of the function printed in the output.

def find_align_ret(self, start, end):
            cur = start
            while(True):
                    if 'ret' in GetMnem(cur) :
                            if isAlign(idaapi.getFlags(NextAddr(cur))):
                                   return cur    
                    cur = NextAddr(cur)
                    if cur >= end:
                            return None 

The second search was to calculate the inverse of the first step in the function prologue. In theory, if some register is pushed on to the stack first, it's going to be popped last.


We would then loop through each instruction until we found pop esi and then find the return. This works in most situations but the formatting and tracking of positive and negative values can be annoying.

def inverse(self,addr):
            inverse = ""
            if GetMnem(addr) == "push" and GetOpnd(addr,0) == "esp":
                    inverse = "pop " + GetOpnd(addr,0)
            if GetMnem(addr) == "push" and GetOpType(addr, 0) == 1 and GetOpnd(addr,0) != "esp":
                    inverse = "pop " + GetOpnd(addr,0)
            if GetMnem(addr) == "sub"and GetOpnd(addr,0) == "esp":
                    inverse = "add " +  GetOpnd(addr,0) + ", " + hex(GetOperandValue(addr,1))[2:] + 'h'
            if  GetMnem(addr) == "add" and GetOpnd(addr,0) == "esp":
                    inverse = "sub " +  GetOpnd(addr,0) + ", " + GetDisasm(addr).split(",")[-1]
            if GetMnem(addr) == "mov" and GetOpnd(addr, 0) == "ebp" and GetOpnd(addr,1) == "esp":
                    # can be used to find the end of the function prologue
                    inverse = "mov ebp, esp"
            return inverse

The last search was to simply search for the hot patching bytes mov edi, edi to find the start of a function.



def test_mov_edi2_find(self, addr):
            'return the address of the next mov edi, edi. Distance needs to be validated' 
            if GetMnem(addr) == "mov" and GetOpnd(addr, 0) == "edi" and GetOpnd(addr, 1) == "edi":
                    mov_addr = FindBinary(addr +1, SEARCH_DOWN, "8B FF")
                    if mov_addr == BADADDR:
                            return None
                    else:
                            return mov_addr
            else:
                    return None

These type of searches work with specific compilers but bounds and error checking would still need to be done in order to determine if it safe range. If the code within the calculated range is simple, it might be okay to mark it as a function automatically. The cyclomatic complexity algorithm could be used to determine the simplicity of a function.  The same concepts can be applied for search the start of a function or from the bottom up. 

One great thing about this script is it allows you to find code that wasn't disassembled properly. It's will find small blocks of code between functions that shouldn't be there. Odds are these techniques could be used after the recursive disassembly algorithm has executed to verify the results or to add undefined functions to a post-analysis queue.

Please feel free to email (source code below), leave a comment or ping me on twitter with any questions or thoughts. Cheers.

Code
 Please download the source code from it's repo. The below code is for this post only and will not be updated.


'''
Name:
        undefinder.py

Version:
        0.2
        
Description:
  IDAPython script that searches for data between known functions

Author:
        alexander<dot>hanel<at>gmail<dot>com

License:
undefinder.py is free software: you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see
<http://www.gnu.org/licenses/>.

'''

class undefinder:
    def __init__(self):
        # self.firstList contains a dictionary of
        # { "id":function_name, "start":function_start_addr, "end":function_end_addr 
        self.firstList = []  
        self.modList = []
        self.populate_mods()
 
    def getKnownFunctions(self):
        "self.firstList is populated with each item in a list with a dictionary of { 'id':function_name, 'start':function_start_addr, 'end':function_end_addr" 
        for funcea in Functions( SegStart( here() ), SegEnd( here() ) ):
            self.firstList.append({"id":GetFunctionName(funcea), "start":GetFunctionAttr(funcea, FUNCATTR_START), "end":GetFunctionAttr(funcea, FUNCATTR_END)})
        return
    
    def calc_dist(self,addr):
        "calcuates the distane of the next aligned address" 
        dist = int(addr % 16)
        if dist < 4:
            return 4 - dist
        if dist < 8:
            return 8 - dist
        if dist < 12:
            return 12 - dist
        if dist < 16:
            return 16 - dist
        else:
            return None

    def check_down(self, func_addr):
        "checks if the data below is a function or other type of data"
        cur_func = idaapi.get_func(func_addr)
        cur_end = cur_func.endEA
        next_func = idaapi.get_next_func(func_addr)
        next_start = None
        # if next_func == None the last function has been found
        # the try/except is used because IDAPython was crashing on
        # the next_func == None comparison if the value was not None
        try:
            if next_func == None:
                next_start = NextAddr(cur_end)
        except:
            next_start = next_func.startEA
        # get start of the next function
        if cur_end == next_start:
            return None, None
        # check if the block is already part of a function
        if GetFunctionName(cur_end) != "":
            return None, None
        if isAlign(idaapi.getFlags(cur_end)):
            # need to check for two types of alignment. The first is the 90 90.. or other recurring pattern
            # if the first two bytes don't match we can calculate the size of the alignment.
            abyte_check = idaapi.get_many_bytes(cur_end,2)
            # check if bytes match and if not 0. This could be changed to only check for 90 or CC
            if abyte_check[0] == abyte_check[1] and abyte_check[0] != "\x00":
                alignByte = idaapi.get_many_bytes(cur_end,1) 
                # loop through align bytes till end
                while idaapi.get_many_bytes(cur_end,1) == alignByte:
                    cur_end += 1
                if GetFunctionName(cur_end + 1) != "":
                    return None, None  
                else:
                    return cur_end, "align"
            else:
                distance = self.calc_dist(cur_end)
                # we can calculat the distance 
                cur_end += distance
                if GetFunctionName(cur_end) != "":
                    return None, None  
                else:
                    return cur_end, "align-c"
        # this is a sketchy check based off of 
        abyte_check = idaapi.get_many_bytes(cur_end,2)
        # check if bytes match and if not 0. This could be changed to only check for 90 or CC
        if abyte_check[0] == abyte_check[1] and  abyte_check[0] != "\x00":
            alignByte = abyte_check[0]
            while idaapi.get_many_bytes(cur_end,1) == alignByte:
                cur_end += 1
            # prev addr equals start of the next function     
            if GetFunctionName(cur_end + 1) != "":
                return None, None
            return cur_end, "align"
        # check if byte is code
        if isCode(idaapi.getFlags(cur_end)) and GetFunctionName(cur_end) == "":
            return cur_end, "code"
        # check if byte is data
        if isData(idaapi.getFlags(cur_end)):
           return cur_end, "data"
        # check if address is valid
        if cur_end == BADADDR:
            return None, None
        else:
            return cur_end, "unknown" 
            
    def check_up(self, func_addr):
        cur_func = idaapi.get_func(func_addr)
        cur_start = cur_func.startEA
        prev_addr = PrevHead(cur_start)        
        if prev_addr == BADADDR:
            return None, None
        # get start of the next function
        if GetFunctionName(prev_addr) != "":
            return None, None
        # quick check to see if from the end of the function
        # is an align byte
        tmp_prev = prev_addr
        for dist in range(0, 15):
            tmp_prev -= dist
            if isAlign(idaapi.getFlags(tmp_prev)):
                if GetFunctionName(tmp_prev-1) != "":
                    return None, None
                else:
                    return tmp_prev, "align"
        # check for align bytes that are static such as CC CC or 90 90
        abyte_check = idaapi.get_many_bytes(prev_addr,2)
        if abyte_check != None:
            if abyte_check[0] == abyte_check[1] and abyte_check[0] != "\x00":
                alignByte = abyte_check[0]
                while idaapi.get_many_bytes(prev_addr,1) == alignByte:
                    prev_addr -= 1
                # prev_addr equals end of the previous function
                if GetFunctionName(PrevHead(prev_addr)) != "":
                        return None, None
                else:
                    return prev_addr, "align"
        # check if byte is code
        if isCode(idaapi.getFlags(prev_addr)) and GetFunctionName(prev_addr) == "":
            return prev_addr, "code"
        # check if byte is data
        if isData(idaapi.getFlags(prev_addr)):
           return prev_addr, "data"
        # check if byte is ascii
        if isASCII(idaapi.getFlags(prev_addr)):
            return prev_addr, "ascii"
        # check if address is valid
        if prev_addr == BADADDR:
            return None, None
        else:
            # Warning: replace "prev_addr" with "None" if large amounts of FPs
            # Calculating the beginning of align bytes isn't the easiet thing. 
            return prev_addr, "unknown"      

    def populate_mods(self):
        self.getKnownFunctions()
        for addr in self.firstList:
            up = self.check_up(addr['start'])
            if None not in up:
                self.modList.append(["up", up[0], up[1]])
            down = self.check_down(addr['start'])
            if None not in down:
                self.modList.append(["down", down[0], down[1]])

    def print_all(self):
    # { "id":function_name, "start":function_start_addr, "end":function_end_addr
        for block in self.modList:
                end = FindFuncEnd(block[1])
                if end == BADADDR:
                    print "addr:%-8x from:%-4s type:%-5s Func End:Unknown" % (block[1],block[0], block[2])
                else:
                    print "addr:%-8x from:%-4s type:%-5s Func End:%x" % (block[1],block[0], block[2], end)                    

class analyze:
 # func = idaapi.get_next_func(here())
 # block[0] is the  direction; block[1] is the address, block[2] is the type
    def analysis_group(self, modList):
  # TO CALL 
  # k = f.analyze
  # k.analysis_group(f.modlist)
        # align, code, data, ascii, align-c, unknown
        for block in modList:
            if block[0] == "up" and block[2] == "align":
                print 'called'
            if block[0] == "down" and block[2] == "align":
                pass
            if block[0] == "up" and block[2] == "align-c":
                pass
            if block[0] == "down" and block[2] == "align-c":
                # if down align-c = pointer will be at the start of the code
                pass
            if block[0] == "up" and block[2] == "code":
                pass
            if block[0] == "down" and block[2] == "code":
                pass
            if block[0] == "up" and block[2] == "data":
                pass
            if block[0] == "down" and block[2] == "data":
                pass
            if block[0] == "up" and block[2] == "ascii":
                pass
            if block[0] == "down" and block[2] == "ascii":
                pass
            if block[0] == "up" and block[2] == "unknown":
                pass
            if block[0] == "down" and block[2] == "unknown":
                pass


    def test_mov_edi2_find(self, addr):
            'return the address of the next mov edi, edi. Distance needs to be validated' 
            if GetMnem(addr) == "mov" and GetOpnd(addr, 0) == "edi" and GetOpnd(addr, 1) == "edi":
                    mov_addr = FindBinary(addr +1, SEARCH_DOWN, "8B FF")
                    if mov_addr == BADADDR:
                            return None
                    else:
                            return mov_addr
            else:
                    return None
                    
    def inverse(self,addr):
            inverse = ""
            if GetMnem(addr) == "push" and GetOpnd(addr,0) == "esp":
                    inverse = "pop " + GetOpnd(addr,0)
            if GetMnem(addr) == "push" and GetOpType(addr, 0) == 1 and GetOpnd(addr,0) != "esp":
                    inverse = "pop " + GetOpnd(addr,0)
            if GetMnem(addr) == "sub"and GetOpnd(addr,0) == "esp":
                    inverse = "add " +  GetOpnd(addr,0) + ", " + hex(GetOperandValue(addr,1))[2:] + 'h'
            if  GetMnem(addr) == "add" and GetOpnd(addr,0) == "esp":
                    inverse = "sub " +  GetOpnd(addr,0) + ", " + GetDisasm(addr).split(",")[-1]
            if GetMnem(addr) == "mov" and GetOpnd(addr, 0) == "ebp" and GetOpnd(addr,1) == "esp":
                    # can be used to find the end of the function prologue
                    inverse = "mov ebp, esp"
            return inverse
   
    
    def find_align_ret(self, start, end):
            cur = start
            while(True):
                    if 'ret' in GetMnem(cur) :
                            if isAlign(idaapi.getFlags(NextAddr(cur))):
                                   return cur    
                    cur = NextAddr(cur)
                    if cur >= end:
                            return None           
        
if __name__ == '__main__':
   f = undefinder()
   f.populate_mods()
   f.print_all()

References or good reads.
http://reverseengineering.stackexchange.com/questions/2347/what-is-the-algorithm-used-in-recursive-traversal-disassembly
https://jdebp.eu./FGA/function-perilogues.html
http://www.agner.org/optimize/optimizing_assembly.pdf
http://www.phrack.com/issues.html?issue=66&id=14

2 comments:

  1. Hey again Alexander,

    I did somewhat similar solve with my "ExtraPass" plug-in:
    http://www.macromonkey.com/bb/viewtopic.php?f=65&t=703

    I thought at first to use patterns too as a lot of epilog and prolog code appears to be consistently paternable.
    But then most of that goes out the door if the target is compiled with some settings that omit call frames, and, or, profile guided link option, etc.
    And then problems with Borland compiled executables (that you see less and less of now of days anyhow) that have a lot of data inlined with
    code.

    I ended up with something a lot more sloppier (yet still works well) that errors on the side code since I assumed it would be better to have more intact functions (vs some fouled up switch statement table data, etc).
    After all what good is data from a missing function if you can't even see the function in the first place?
    You can always fix up that data manually for a function once you find it (the significant ones).

    I look at each gap between existing functions and if doesn't follow a data pattern I assume it's code and try to convert it as such then see if I can place a function there too.

    And with alignment, also in those gaps I just look for a run of 0x90 or 0xCC's.

    The biggest IDA function problem pattern appear to be those that have embedded exception handlers . These functions that don't return, etc.

    IMHO something you are missing form your script here is to handle these big blocks of falsely declared data blocks (in the ".text" segments) that are actually code.
    Another occasional IDA hiccup you will see these big "DB" or "DD" declared blocks out of the blue that typically contain one or even many functions in them.
    In particular a problem in a lot of targets that have disconnected functions (like those derived from C++). They have no apparent entry point ref so IDA has a harder time seeing entry points.
    One of the default optional passes in my plug-in fixes these by attempting to turn these block into code.

    I like your more precise pattern matching. I could use this to get more accuracy on function boundaries, etc.

    ReplyDelete
  2. Thanks for the comments.

    > And with alignment, also in those gaps I just look for a run of 0x90 or 0xCC's.

    I'd recommend checking for a JMP before the 0x90 or 0xCCs. Alignment bytes can be used in the middle of a perilogue to align JMPs.

    > IMHO something you are missing form your script here is to handle these big blocks of falsely declared data blocks (in the ".text" segments) that are actually code.

    Just a note, relying on the ".text" segment will not work on position independent code such as shellcode or code allocated on the heap without a PE header.

    The capability to find the data is there. The hit would look something like this "addr:408548 from:up type:data Func End:Unknown". There are scenarios when the data wouldn't be found such as if no known functions sharing a boundaries with the data. It would be up to whoever is executing the script to come up with the logic to determine if it's valid code or not. In my testing determining if data is actually valid code is not trivial. There are the low hanging fruits of searching for byte codes or instructions but that only works on a small set. A full X86 assembler front end compiler would need to be written in order to determine if the instructions are valid or not.

    > disconnected functions (like those derived from C++).

    Yes, I came across those also. If you have a large set of those in a small range of addresses you can flag those as jump table or something to come back to.

    As I stated this code is more of framework for exploring. There are far too many compiler and compiler options to be able to automatically find every undefined functions. I only needed a "good enough" approach.

    ReplyDelete