reiat.py - Using Data-Flow to Track Dynamically Loaded APIs

Run-time dynamic linking is commonly seen when reverse engineering malware. It is useful when the programmer wants to load libraries without the use of the Windows loader. This technique can be used to save memory. A side effect of this approach is the libraries and the API names will not be in the import table of the portable executable file format. There are three steps in run-time dynamic linking. The first step is loading the library into memory,  done via LoadLibary. The second step is to get the address of exported function via GetProcAddress and lastly to call it. 

#include <stdio.h>
#include <windows.h>

typedef void (WINAPI *PGNSI)(HWND, LPCSTR, LPCSTR, UINT);

int main(void)
{
 PGNSI pGNSI;
 HANDLE hdll = LoadLibrary("User32.dll");
 if (NULL != hdll)
  printf("User32 loaded at %x\n", hdll);
  pGNSI = (PGNSI)GetProcAddress(hdll, "MessageBoxA");
  if (NULL != pGNSI)
   pGNSI(NULL, "MessageBoxA Was Called", "Yep", 0);
 return 0;
} 
 
 
The above C code loads the library User32.dll, gets the address of MessageBox using GetProcAddress and then calls the function. If we were to view the above C code in assembly compiled in Visual Studio we would get the following code.

Calling Locally Allocated Variables
In this simple example we can see that the return value (eax) of GetProcAddress is saved into [ebp+var_8] and then is called. All the code is contained in a single function and called via local variable. In most situations the API address is stored in a global variable. Take for example the following code.  The main thing to notice is the use of the API address getting passed to a dword.

Calling Global Variables
In the xrefs to the dword_1000F1E8 window we can see that we do not have any context to what the dword is. In order to find what the dword value is we would need to trace back to when the variable was populated (mov dword_1000F1E8, eax). Any malware analyst that has spent time reversing has come across this problem before. The first couple of times we will rename all the dwords manually, later we then realize this sucks and we should probably learn scripting in IDA. After that we make a bunch of one off scripts to populate the value for us. This works well if the pattern is consistent and we want to rename the dwords.  What if we didn't know the pattern? What if rather than calling a dword the code was calling a local variable and we wanted to comment it with the API name? Or better yet, what if we didn't want to write another script for renaming exported run-time functions. I have my doubts the later can be completed but we might as well try. 

Run-time dynamic linking APIs can be recovered typically in one of three methods. The first method uses global variables.We first locate where the dword that is being called is the destination (mov, dword_00XXXX, eax) . From there we will trace the source register (eax) to where it was populated. If the register we are interested becomes the destination, we will then start tracing the source register, this will continue until our source register is eax and the previous call is GetProcAddress. Once we found GetProcAddress we read the second argument (lpProcName) and now we can label or comment the call with the name of the API. An example of this can be seen in the Calling Global Variables image above. The second method is similar to the previously described method except the call is to a local variable rather than a dword. An example of this can be seen in Calling Locally Allocated Variables image from the MessageBoxA example. The third method is when the LoadLibrary and GetProcAddress are handled within a function and passed or returned back to the caller.. This method is not always static. Each programmer can decide to implement this is their own way. The programmer could have the function return the address, save the address in a buffer passed as an argument or etc, etc. This method gets more complicated because we have to reverse the function and understand how the address is being returned or saved off. 

In regards to these methods we only care about two things. The second argument of GetProcAddress and where we are calling the return of GetProcAddress. Since we are not calculating or manipulating the return address (xor, mul, add, etc) all we need to do is track when the data is moved around. The process of tracking data being used is called data-flow analysis and is a fundamental concept in compilers. Note: I learned about this concept last week. I found it by asking myself that great question of "What would a person smarter than myself call this?". Wikipedia for the win. Data-flow analysis is an extremely powerful tool in static analysis. We can use it to estimate/infer how data is being used during dynamic analysis and trace where data originated from. Okay time for some examples and code. 

reiat.py is a script that will use data-flow to rename or add comments to calls to APIs that were loaded via run-time dynamic linking. If we were to run reiat.py on the MessageBoxA we would see the following comment to where the function is called.
MessageBoxA Comment Added by reiat.py
It will track code the use of the variable all the way to the end of function. One example tracked the API after it had been moved five times and called a hundred bytes away.  Personally I think this is pretty cool. Here is the change on the global variables.

dwords renamed to the APIs
One tricky example is when the second argument is pushed before a call to another function such as GetModuleHandleA. This technique is common in Zeus.

Time for example of where the script fails. If the script fails the output will contain an Error message with the API name and location. This should help with tracking it down.  Here are two examples.

Odds are I will be continuing working on data-flow analysis in IDA. If you know of any good papers, code or have some comments please send me an email (source code below), ping me on twitter or leave a comment.  I'm still working on parts of the code but I'm hopeful it will be useful to others.

BitBucket Repository - LINK

Please download code from the repo link. New versions will not be updated below.

'''
Name:
        reiat.py

Version:
        0.2

Description:
        renames and add coments to apis that are are called via run-time dynamic analysis in IDA.
 To execute the script just call it in IDA 

Author:
        alexander<dot>hanel<at>gmail<dot>com

License:
reiat.py is free software: you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see
<http://www.gnu.org/licenses/>.

'''

from idaapi import *
import idautils
import idc

class getProcAddresser():
    def __init__(self):
        self.getProcAddressRefs = []
        self.registers = ['eax', 'ebx', 'ecx', 'edx', 'esi', 'edi', 'esp', 'ebp']

    def getRefs(self):
        'get all addresses of GetProcAddress'
        for addr in CodeRefsTo(LocByName("GetProcAddress"), 0):
            self.getProcAddressRefs.append(addr)

    def getlpProcName(self, GetProcAddress):
        'returns the address of the 2nd argument to GetProcAddress'
        pushcount = 0
        argPlacement = 2
        instructionMax = 10 + argPlacement
        currAddress = PrevHead(GetProcAddress,minea=0)
        while pushcount <= argPlacement and instructionMax != 0:
            if 'push' in GetDisasm(currAddress):
                pushcount += 1
                if pushcount == argPlacement:
                    return currAddress
            if 'GetModuleHandle' in GetDisasm(currAddress):
                    pushcount -= 1
            instructionMax -= 1
            currAddress = PrevHead(currAddress,minea=0)
        return None

    def getString(self, address):
        'reads the string value that is the second push'
        # note it will be useful to include back tracing code for variable reference. 
        api = GetString(GetOperandValue(address,ASCSTR_C), -1)
        if api == None:
            return None
        else:
            return api  

    def traceBack(self, address):
        funcStart = GetFunctionAttr(address, FUNCATTR_START)
        var = GetOpnd(address, 0)
        # return if digit is being pushed, likely an error on parsing
        if var.isdigit():
            return None
        # return, value is not being passed as a register. already checked
        # if offset in calling function
        if var not in self.registers:
            return None
        # get next address
        currentAddress = PrevHead(address)
        # get dism 
        dism = GetDisasm(currentAddress)
        # until end of function
        # Example:
        # mov ebp, offset aInternetconnec ; "InternetConnectA"
        # push    ebp
        while(currentAddress >= funcStart):
            # var = 'ebp', 
            if var in dism:
                # if operand == ebp, our tracked var is the destination
                if GetOpnd(currentAddress,0) == var:
                    mnem = GetMnem(currentAddress)
                    # if our tracked var is having something moved into it
                    if 'mov' in mnem or 'lea' in mnem:
                        # 4 scenarios on mov: string, digit, register unknown..
                        # 1. Check if destination is a string
                        # read operand 1 value, get address of "offset aInternetconnec"
                        value = GetOperandValue(currentAddress,1)
                        if value != None:
                            api = GetString(value, -1)
                            if api != None:
                                return api
                        # 2. Check if register
                        var = GetOpnd(currentAddress,1)
                        # 3. Check if digit
                        if var.isdigit() == True:
                            return None
                        # 4. Unknown
                        if var == None:
                            return None
                        
            currentAddress = PrevHead(currentAddress)
            dism = GetDisasm(currentAddress)
        return None

    def traceForwardRename(self, address, apiString):
        'address is call GetProcAddress, apiString is the API name'
        currentAddress = NextHead(address)
        funcEnd = GetFunctionAttr(address,  FUNCATTR_END)
        var = 'eax'
 lastref = ''
 lastrefAddress = None
        while currentAddress < funcEnd:
            dism = GetDisasm(currentAddress)
            # if we are not referencing the return from GetProcAddress
            # continue to next instuction
            if var not in dism:
                currentAddress = NextHead(currentAddress)
                continue
            #   mov     dword_1000F224, eax
            #   call    esi ; GetProcAddress
            #   push    offset aHttpaddreque_0 ; "HttpAddRequestHeadersW"
            #   push    dword_1000FD08  ; hModule
            #   mov     dword_1000F228, eax 
            # if we have the above instructions after GetProcAddress the code
            # is saving off the address of HttpAddRequestHeadersW.  
            if GetMnem(currentAddress) == 'mov' and GetOpnd(currentAddress,1) == var and GetOpType(currentAddress,0) == 2:
                # rename dword address
  status = True
                status = MakeNameEx(GetOperandValue(currentAddress,0), apiString, SN_NOWARN)
  if status == False:
   # some api names are already in use. Will need to be renamed to something generic. 
   # IDA will typically add a number to the function or api name. GetProcAddress_0
   status = MakeNameEx(GetOperandValue(currentAddress,0), str("__" + apiString), SN_NOWARN)
   if status == False:
    return None
                return currentAddress
     # tracked data is being moved into another destination
            if GetMnem(currentAddress) == 'mov' and GetOpnd(currentAddress,1) == var:
  lastref = var
  lastrefAddress = currentAddress
                var = GetOpnd(currentAddress,0)
            # add comments for call var
            # example:
            # call    ds:GetProcAddress
            # ...
            # call    eax
            if GetMnem(currentAddress) == 'call' and GetOpnd(currentAddress,0) == var:
                cmt = GetFunctionCmt(currentAddress,1)
                if apiString not in cmt:
                    cmt = cmt + ' ' + apiString
                    MakeComm(currentAddress, cmt)
                    return currentAddress
            
     # eax is usually over written by the the return value 
     if GetMnem(currentAddress) == 'call' and var == 'eax':
                return None
            currentAddress = NextHead(currentAddress)
        return None
    
    def rename(self):
        self.getRefs()
        for addr in self.getProcAddressRefs:
            lpProcNameAddr = self.getlpProcName(addr)
            if lpProcNameAddr == None:
                print "ERROR: Address of lpProcName at %s was not found" % hex(addr)
                continue
            lpProcName =  self.getString(lpProcNameAddr)
            if lpProcName == None:
                lpProcName = self.traceBack(lpProcNameAddr)
            if lpProcName == None:
                print "ERROR: String of lpProcName at %s was not found" % hex(addr)
                continue
            status = self.traceForwardRename(addr, lpProcName)
            if status == None:
                print "ERROR: Could not rename address at %s " % hex(addr)
                continue
            else:
                print "RENAMED %s at %s" % ( lpProcName, hex(status))
 

if __name__ == "__main__":
    ok = getProcAddresser()
    ok.rename()
        

2 comments:

  1. Thank you for this useful script.
    Feature request: Oftentimes, some DLLs stash away pointers using EncodePointer() and later (at time of use) call DecodePointer() to get the real pointer. It may be possible to enhance your script to "see through" this encode-decode step and annotate the final decoded pointer with the real function name.

    ReplyDelete
  2. Nicely done, very useful, thanks for the script ;)

    ReplyDelete