Creating Your Own Virustotal..Well Kind Of..Ok, Not Really

Virustotal is free website that allows individual to upload files to be scanned by 41 anti-virus engines and file identification tools. Virustotal was created in 2004 by Hispasec Sistemas. Virustotal is a very valuable tool for many system admins, technical support engineers or anyone else who is curious if a file malicious. The main downside to Virustotal is that you have to submit your samples through a web interface or email. Sometimes it's nicer to have a quick report from the command line but in certain situations you can't submit the samples.

This post focuses on analyzing Microsoft Portable Executable (PE) files using Python, PeFile, PEID and a command line anti-virus scanner. This post will display the Python code for creating the PE information displayed under the anti-Virus scanner results on Virustotal results page. The post also will contain the code to call and display the results for an anti-virus command line scanner.

Python still feels new to me so please feel free to email me or leave a comment if you have any recommendations or advice on the code. The full source code can be found here.

Let's start at the top of the code with the imports.

import sys
import os
import pefile
import peutils
import math
import time
import datetime
import subprocess

The only non-standard modules listed above are Pefile and Peutils. These were created by Ero Carrea. Both of these modules can be found here. For anyone new to Python, modules are files that contain statements and definitions that can be used to import functionality. Please see this article for more details.

Below is the first function that will be called. The function "attributes" extracts basic information from the Portable Executable. The Portable Executable is a file format that the Windows operating system uses to encapsulates data and code. The Windows OS Loader uses the data structure to manage the wrapped executable code.

## Print PE file attributes
def attributes():
print "Image Base:", hex(pe.OPTIONAL_HEADER.ImageBase)
print "Address Of Entry Point:", hex(pe.OPTIONAL_HEADER.AddressOfEntryPoint)
machine = 0
machine = pe.FILE_HEADER.Machine
print "Required CPU type:", pefile.MACHINE_TYPE[machine]
print "DLL:", dll
print "Subsystem:", pefile.SUBSYSTEM_TYPE[pe.OPTIONAL_HEADER.Subsystem]
print "Compile Time:", datetime.datetime.fromtimestamp(pe.FILE_HEADER.TimeDateStamp)
print "Number of RVA and Sizes:", pe.OPTIONAL_HEADER.NumberOfRvaAndSizes

Above we are using PeFile to read and display data from the PE format. Before using Pefile the executable will need to be loaded into memory. This will be covered when Main() is discussed.


Portable Executable Information
Optional Header: 0x400000
Address Of Entry Point: 0x78020
Required CPU type: IMAGE_FILE_MACHINE_I386
DLL: False
Compile Time: 2007-04-29 05:43:12
Number of RVA and Sizes: 16

Attributes from the PE can give information about the executable that can be valuable when analyzing an unknown executable. The first output line is the Image Base of the executable. The default value for an application is 0x00400000 and 0x10000000 for a Dll. A value that is non-standard could be a possible flag and should be noted.

The second output is the address of the entry point. The entry point is the starting address for the executable, address of the initialization function for device drivers and the entry point is optional for a Dll. The standard value for the entry point is 0x01000. If the value is non-standard it should be noted because the entry point could have possibly been changed by a packer or obfuscation tool.

The third output is the Machine Type. This can be used to identify if the file is 32-bit or 64-bit. If the Machine Type value is IMAGE_FILE_MACHINE_I386 then executable is 32-bit. To identify a 64-bit executable the Machine Type value would be IMAGE_FILE_MACHINE_AMD64 or IMAGE_FILE_MACHINE_IA64.

The fourth output simply identifies if the executable is DLL or not.

The fifth output identifies the subsystem type. In the example above IMAGE_SUBSYSTEM_WINDOWS_GUI identifies that the executable has a Windows GUI. If the subsystem was IMAGE_SUBSYSTEM_WINDOWS_CUI it would mean the executable is a console application. The third common subsystem type is IMAGE_SUBSYSTEM_NATIVE. This is reserved for drivers and native system processes. Just a side not to idenitfy a Xbox executable the subsystem type would be IMAGE_SUBSYSTEM_XBOX.

The sixth line of output is the time the executable was compiled.

The last line of output displays the NumberOfRvaAndSizes. The default value for this attribute is 0x10 or 16 in decimal. Modification of this value can be used to crash Ollydbg. If the value is non-standard and Ollydbg gives an error patching this field might be needed. These values are helpful because they give clues if there is anything non-standard about the executable. Non-standard values could be from a packer, some code obfuscation tool or by the programmer passing the compiler flags.

def sections_analysis():
print "Number of Sections:", pe.FILE_HEADER.NumberOfSections
print "Section VirtualAddress VirtualSize SizeofRawData Entropy"
for section in pe.sections:
print "%-8s" % section.Name, "%-14s" % hex(section.VirtualAddress), "%-11s" % hex(section.Misc_VirtualSize),\
"%-13s" % section.SizeOfRawData, "%.2f" % E(

The next function called is "sections_analysis". This sections outputs the name of the sections, it's virtual address, it's virtual size, size of raw data and it's entropy. This information is valuable because certain packers will modify these attributes. Packers will sometime rename sections, create new sections, or modify other attributes. For a detailed analysis of how packer work please see Websense's "The History of Packing Technology".

Number of Sections: 3

Section VirtualAddress VirtualSize SizeofRawData Entropy
UPX0 0x1000 0x42000 0 0.00
UPX1 0x43000 0x36000 217600 7.93
.rsrc 0x79000 0x2000 7680 4.00

The above output gives hints to what type of packer was used just be reviewing the section names. In the case of the test executable, Putty.exe was compressed using UPX. The string UPX can be observed in the section names. Two other attributes that should be mentioned is the zero size of the raw data of UPX0 and the high entropy of the section UPX1. During execution of a packed executable UPX will need to write the uncompressed data to a section. The section UPX0 with a size of zero will be used for storing the uncompressed data. Not all packers will use this technique but this is another characteristics that should be noted.

## Entropy calculation from Ero Carrera's blog ###############
def E(data):
entropy = 0
if not data:
return 0
ent = 0
for x in range(256):
p_x = float(data.count(chr(x)))/len(data)
if p_x > 0:
entropy += - p_x*math.log(p_x, 2)
return entropy

Entropy is a measurement of how organized or disorganized data is. The more random the data is the higher the entropy will be. Packers will apply an algorithm to either compress the data or obfuscate it. The output of the packed data has a higher entropy and is more random than the original compiled executable. The entropy range for the "E" function ranges between 0.0 and 8.0. The closer the entropy is to 8.0 the higher the chances that the section is packed or obfuscated. Please see here for more details.

## Load PEID userdb.txt database and scan file
def PEID():
signatures = peutils.SignatureDatabase('userdb.txt')
matches = signatures.match_all(pe,ep_only = True)
print "PEID Signature Match(es): ", matches

PEID is tool for detecting common packers, cryptors and compilers for PE files. By identifying the algorithm the PE file is compressed or obfuscated it can help speed up analysis. Google, the name of the packer and the string "tutorial" will usually return an analysis on how to unpack the file. Pefile has functionality to use PEID's user database (userdb.txt) to scan a PE file. The main PEID user database was created by BobSoft and can be downloaded here. Many individuals use Panda Anti-Virus userdb.txt. A warning to users of Panda's userdb.txt. Pefile will throw exceptions and will not work due to some non-standard characters. It's recommended to use BobSoft's userdb.txt.

PEID Signature Matche(s): [['UPX 2.90 [LZMA] -> Markus Oberhumer, Laszlo Molnar & John Reiser']]

The last function is called IAT(). This function will display all imported Dlls and the imported API name.

## Dump Imports
def IAT():
print "Imported DLLS:"
i = 1
bool = 1 ## For Formattting
print "%2s" % [i], "%-17s" % entry.dll
print "\t",
for imp in entry.imports:
if bool:
print "%-1s" %,
bool = 0
sys.stdout.write("%s%s" % (", ", # Python Print adds a blank space
i += 1

Reviewing all API calls can help with giving a high level view of the behavior of the executable. If the APIs list is sparse this might be a sign that the executable has had it's Import table removed. The standard "Hello, World" complied in C using LCC would contain 19 API names. If there are only three APIs listed such as LoadLibrary, GetProcAddress and ExitProcess odds are the file is packed or obfuscated in some manner.

Imported DLLS:
LoadLibraryA, GetProcAddress, ExitProcess
[2] ADVAPI32.dll
[3] COMCTL32.dll
[4] comdlg32.dll
[5] GDI32.dll
[6] IMM32.dll
[7] SHELL32.dll
[8] USER32.dll
[9] WINMM.dll

The final function calls a local command line anti-virus scanner.

## Print Sophos
def sophos(filetmp):
print "Sophos Scan in progress.."
output = "None"
path = os.path.abspath(filetmp)
pwd = os.getcwd()
output =[os.path.join(pwd, 'cmd_scan', 'Sophos', 'SAV32CLI.EXE'), path])

Sophos Scan in progress..
Sophos Anti-Virus
Version 4.52.0 [Win32/Intel]
Virus data version 4.52E, April 2010
Includes detection for 1542982 viruses, trojans and worms
Copyright (c) 1989-2010 Sophos Plc. All rights reserved.

System time 21:36:32, System date 03 May 2010

Quick Scanning

1 file swept in 3 seconds.
No viruses were discovered.
Ending Sophos Anti-Virus.

This function will need to be copied and modified for each scanner. There are a few free anti-virus scanners. The function above used Sophos's free command line scanner. The scanner can be downloaded here. Other free scanner are ClamAV, Panda and a couple of others. ClamAV can be used as a portable application and does not need to be fully installed. The portable ClamAV can be downloaded here. Most purchased anti-virus applications do have command line interface.

if len(sys.argv) < 2:
print "Pyton Script "
exename = sys.argv[1]
pe = pefile.PE(exename)
print "\nPortable Executable Information"

<- Format bug with SyntaxHighlighter (remove line)

This script will need to have a Portable executable file passed to it. Once a file is passed Pefile will load the executable and then call all the functions.

In closing, this script is a quick example of using Python to analyze an unknown executable file. The current version is 0.01. Overtime I'll keep updating the script. As previously stated if there are any question or ideas please leave a comment.


  1. Nicely done.

    The big remaining question, however, is how good static analysis of PE files is compared to files actually running.

    I know some AV's have a small emulator inside them (kind of like Wine) and then "emulates" the file to some extend before actually allowing it to launch

  2. "however, is how good static analysis of PE files is compared to files actually running".

    Good question @smash, I could probably write a whole page on this. I would say that it depends on the end result. I mainly analyze the portable executable to get a good idea if the file is packed/obfuscated. If the PE file isn't packed I'll open it up in IDA for static analysis. If the file is packed I'll open it up in Ollydbg for dynamic analysis. Running an executable is a great way to get high level details such as files written to disk, registry traces or network connections. What about in the situation where the network traffic is obfuscated with a logical ROR? Then your boss says I want to know what that obfuscated traffic is. In this situation static analysis is more useful.

    I'm a kinetic learner so I do a hybrid of static and dynamic analysis. I have IDA in one window and a VM with Ollydbg in another. I comment almost every line of code in IDA. My theory is if you aren't writing you aren't reversing. I'm currently trying to shift my brain from relying on dynamic analysis and mainly rely on static analysis. This is forcing me to focus on the data being passed and not just the APIs getting called. For anyone else trying to do this I would recommend focusing on shellcode. My next post might be about this.

  3. We are on the same page regarding our own efforts to analyzing binaries either statically or dynamically. I believe in both too :)

    I was more thinking about what chances command line AV's have of actually getting the binary unpacked in their own "emulator" and THEN looking for signatures or behaviour. They have no chance for either before it's unpacked in memory.

    I wonder what Virustotal does. I can't imagine them having a clutser of VM's launching with a unique AV product inside each every time they receive a file :)

  4. Ahh, I see what you're asking :) From what I have heard (second hand) VirusTotal is only command line scanners.

    I pinged a friend of mine who works in the AV industry in regards to the command line scanner and he had a good point. Some cli scanners and full clients use the same engine. Most of these engines have the capabilities for unpacking files and scanning for signatures. But what are the limitations or differences between the cli scanner and the full clients? That's a very interesting question.

  5. how to run this code in linux mint?