This post focuses on analyzing Microsoft Portable Executable (PE) files using Python, PeFile, PEID and a command line anti-virus scanner. This post will display the Python code for creating the PE information displayed under the anti-Virus scanner results on Virustotal results page. The post also will contain the code to call and display the results for an anti-virus command line scanner.
Python still feels new to me so please feel free to email me or leave a comment if you have any recommendations or advice on the code. The full source code can be found here.
Let's start at the top of the code with the imports.
The only non-standard modules listed above are Pefile and Peutils. These were created by Ero Carrea. Both of these modules can be found here. For anyone new to Python, modules are files that contain statements and definitions that can be used to import functionality. Please see this article for more details.
Below is the first function that will be called. The function "attributes" extracts basic information from the Portable Executable. The Portable Executable is a file format that the Windows operating system uses to encapsulates data and code. The Windows OS Loader uses the data structure to manage the wrapped executable code.
## Print PE file attributes
print "Image Base:", hex(pe.OPTIONAL_HEADER.ImageBase)
print "Address Of Entry Point:", hex(pe.OPTIONAL_HEADER.AddressOfEntryPoint)
machine = 0
machine = pe.FILE_HEADER.Machine
print "Required CPU type:", pefile.MACHINE_TYPE[machine]
dll = pe.FILE_HEADER.IMAGE_FILE_DLL
print "DLL:", dll
print "Subsystem:", pefile.SUBSYSTEM_TYPE[pe.OPTIONAL_HEADER.Subsystem]
print "Compile Time:", datetime.datetime.fromtimestamp(pe.FILE_HEADER.TimeDateStamp)
print "Number of RVA and Sizes:", pe.OPTIONAL_HEADER.NumberOfRvaAndSizes
Above we are using PeFile to read and display data from the PE format. Before using Pefile the executable will need to be loaded into memory. This will be covered when Main() is discussed.
Portable Executable Information
Optional Header: 0x400000
Address Of Entry Point: 0x78020
Required CPU type: IMAGE_FILE_MACHINE_I386
Compile Time: 2007-04-29 05:43:12
Number of RVA and Sizes: 16
Attributes from the PE can give information about the executable that can be valuable when analyzing an unknown executable. The first output line is the Image Base of the executable. The default value for an application is 0x00400000 and 0x10000000 for a Dll. A value that is non-standard could be a possible flag and should be noted.
The second output is the address of the entry point. The entry point is the starting address for the executable, address of the initialization function for device drivers and the entry point is optional for a Dll. The standard value for the entry point is 0x01000. If the value is non-standard it should be noted because the entry point could have possibly been changed by a packer or obfuscation tool.
The third output is the Machine Type. This can be used to identify if the file is 32-bit or 64-bit. If the Machine Type value is IMAGE_FILE_MACHINE_I386 then executable is 32-bit. To identify a 64-bit executable the Machine Type value would be IMAGE_FILE_MACHINE_AMD64 or IMAGE_FILE_MACHINE_IA64.
The fourth output simply identifies if the executable is DLL or not.
The fifth output identifies the subsystem type. In the example above IMAGE_SUBSYSTEM_WINDOWS_GUI identifies that the executable has a Windows GUI. If the subsystem was IMAGE_SUBSYSTEM_WINDOWS_CUI it would mean the executable is a console application. The third common subsystem type is IMAGE_SUBSYSTEM_NATIVE. This is reserved for drivers and native system processes. Just a side not to idenitfy a Xbox executable the subsystem type would be IMAGE_SUBSYSTEM_XBOX.
The sixth line of output is the time the executable was compiled.
The last line of output displays the NumberOfRvaAndSizes. The default value for this attribute is 0x10 or 16 in decimal. Modification of this value can be used to crash Ollydbg. If the value is non-standard and Ollydbg gives an error patching this field might be needed. These values are helpful because they give clues if there is anything non-standard about the executable. Non-standard values could be from a packer, some code obfuscation tool or by the programmer passing the compiler flags.
print "Number of Sections:", pe.FILE_HEADER.NumberOfSections
print "Section VirtualAddress VirtualSize SizeofRawData Entropy"
for section in pe.sections:
print "%-8s" % section.Name, "%-14s" % hex(section.VirtualAddress), "%-11s" % hex(section.Misc_VirtualSize),\
"%-13s" % section.SizeOfRawData, "%.2f" % E(section.data)
The next function called is "sections_analysis". This sections outputs the name of the sections, it's virtual address, it's virtual size, size of raw data and it's entropy. This information is valuable because certain packers will modify these attributes. Packers will sometime rename sections, create new sections, or modify other attributes. For a detailed analysis of how packer work please see Websense's "The History of Packing Technology".
Number of Sections: 3
Section VirtualAddress VirtualSize SizeofRawData Entropy
UPX0 0x1000 0x42000 0 0.00
UPX1 0x43000 0x36000 217600 7.93
.rsrc 0x79000 0x2000 7680 4.00
The above output gives hints to what type of packer was used just be reviewing the section names. In the case of the test executable, Putty.exe was compressed using UPX. The string UPX can be observed in the section names. Two other attributes that should be mentioned is the zero size of the raw data of UPX0 and the high entropy of the section UPX1. During execution of a packed executable UPX will need to write the uncompressed data to a section. The section UPX0 with a size of zero will be used for storing the uncompressed data. Not all packers will use this technique but this is another characteristics that should be noted.
## Entropy calculation from Ero Carrera's blog ###############
entropy = 0
if not data:
ent = 0
for x in range(256):
p_x = float(data.count(chr(x)))/len(data)
if p_x > 0:
entropy += - p_x*math.log(p_x, 2)
Entropy is a measurement of how organized or disorganized data is. The more random the data is the higher the entropy will be. Packers will apply an algorithm to either compress the data or obfuscate it. The output of the packed data has a higher entropy and is more random than the original compiled executable. The entropy range for the "E" function ranges between 0.0 and 8.0. The closer the entropy is to 8.0 the higher the chances that the section is packed or obfuscated. Please see here for more details.
## Load PEID userdb.txt database and scan file
signatures = peutils.SignatureDatabase('userdb.txt')
matches = signatures.match_all(pe,ep_only = True)
print "PEID Signature Match(es): ", matches
PEID is tool for detecting common packers, cryptors and compilers for PE files. By identifying the algorithm the PE file is compressed or obfuscated it can help speed up analysis. Google, the name of the packer and the string "tutorial" will usually return an analysis on how to unpack the file. Pefile has functionality to use PEID's user database (userdb.txt) to scan a PE file. The main PEID user database was created by BobSoft and can be downloaded here. Many individuals use Panda Anti-Virus userdb.txt. A warning to users of Panda's userdb.txt. Pefile will throw exceptions and will not work due to some non-standard characters. It's recommended to use BobSoft's userdb.txt.
PEID Signature Matche(s): [['UPX 2.90 [LZMA] -> Markus Oberhumer, Laszlo Molnar & John Reiser']]
The last function is called IAT(). This function will display all imported Dlls and the imported API name.
## Dump Imports
print "Imported DLLS:"
i = 1
for entry in pe.DIRECTORY_ENTRY_IMPORT:
bool = 1 ## For Formattting
print "%2s" % [i], "%-17s" % entry.dll
for imp in entry.imports:
print "%-1s" % imp.name,
bool = 0
sys.stdout.write("%s%s" % (", ",imp.name)) # Python Print adds a blank space
i += 1
Reviewing all API calls can help with giving a high level view of the behavior of the executable. If the APIs list is sparse this might be a sign that the executable has had it's Import table removed. The standard "Hello, World" complied in C using LCC would contain 19 API names. If there are only three APIs listed such as LoadLibrary, GetProcAddress and ExitProcess odds are the file is packed or obfuscated in some manner.
LoadLibraryA, GetProcAddress, ExitProcess
The final function calls a local command line anti-virus scanner.
## Print Sophos
print "Sophos Scan in progress.."
output = "None"
path = os.path.abspath(filetmp)
pwd = os.getcwd()
output = subprocess.call([os.path.join(pwd, 'cmd_scan', 'Sophos', 'SAV32CLI.EXE'), path])
Sophos Scan in progress..
Version 4.52.0 [Win32/Intel]
Virus data version 4.52E, April 2010
Includes detection for 1542982 viruses, trojans and worms
Copyright (c) 1989-2010 Sophos Plc. All rights reserved.
System time 21:36:32, System date 03 May 2010
1 file swept in 3 seconds.
No viruses were discovered.
Ending Sophos Anti-Virus.
This function will need to be copied and modified for each scanner. There are a few free anti-virus scanners. The function above used Sophos's free command line scanner. The scanner can be downloaded here. Other free scanner are ClamAV, Panda and a couple of others. ClamAV can be used as a portable application and does not need to be fully installed. The portable ClamAV can be downloaded here. Most purchased anti-virus applications do have command line interface.
if len(sys.argv) < 2:
print "Pyton Script
"<- Format bug with SyntaxHighlighter (remove line)
exename = sys.argv
pe = pefile.PE(exename)
print "\nPortable Executable Information"
This script will need to have a Portable executable file passed to it. Once a file is passed Pefile will load the executable and then call all the functions.
In closing, this script is a quick example of using Python to analyze an unknown executable file. The current version is 0.01. Overtime I'll keep updating the script. As previously stated if there are any question or ideas please leave a comment.