extflow.py version 2

Note: The IP addresses have been removed
The original script had some logic flaws that I didn't realize. Thankfully, a helpful Redditor by the name of pi-rho pointed them out. The flaws had to deal with gzipped streams, multiple files in a stream and using the old version of tcpflow.

What is extflow.py? It's script that uses tcpflow to separate a pcap into streams and then checks the headers of the streams for file signatures. There are a number of similar tools but I have found them to get confused when scanning files that contain embedded files. An example would be a PDF that contains a SWF. In the image above we can see a pcap that was used to capture the payload from a drive by exploit. The first thing we will need in order to run extflow.py is the newest version of tcpflow. The newest version can be downloaded from the following link. Most repositories will not have the new version if you try installing via yum or apt-get. I'd recommend doing a wget, decompress, a ./configure, make, then make install and we should be ready to go. Once tcpflow is installed we will just need to pass the script the pcap.  The script will call tcpflow with the -AH -r flags.
  • -AH : extract HTTP objects and unzip GZIP-compressed HTTP messages
  • -r: read packets from tcpdump pcap file (may be repeated)
All the streams will be written to a directory named tcpflow_out in the working directory.  All streams that contain a known header will be written to the working directory.  This can be very useful when having to carve out an executable file from a stream.

To do:
 Integrate hachior-subfile for identifying files.

extflow.py  - download

#!/usr/bin/env python

# extflow.py v.2 created by alexander.hanel@gmail.com
# This is a simple script that will carve out files
# from streams created by tcpflow.

# find ../ -name 'dump.pcap' -exec ./ext-flow.py {} \; 

import hashlib
import os.path
import sys
import re
from StringIO import StringIO
import subprocess as sub

def MD5(d):
# d = buffer of the read file 
# This function hashes the buffer
# source: http://stackoverflow.com/q/5853830
    if type(d) is str:
      d = StringIO(d)
    md5 = hashlib.md5()
    while True:
        data = d.read(128)
        if not data:
            break
        md5.update(data)
    return md5.hexdigest()

def check_tcpflow_ver():
    p = sub.Popen(['tcpflow', '-V'], stdout=sub.PIPE, stderr=sub.PIPE)
    out = p.communicate()[0]
    # for longevity reasons, this is crappy logic to check for the version 
    if 'tcpflow 1.' not in out and 'tcpflow 2.' not in out:
        print "\t[ERROR] Please download 1.0 or higer"
        print "\tDownload: https://github.com/simsong/tcpflow/"
        sys.exit(1)

def ext(header):
    # To add a new signature add your own elif statement
    #    elif 'FILE SIGNATURE' in header:
    #    return 'FILE EXTENSION'
    if 'MZ' in header:
        return '.mz'
    elif 'FWS' in header:
        return '.swf'
    elif 'CWS' in header:
        return '.swf'
    elif 'html' in header:
        return '.html'
    elif '\x50\x4B\x03\x04\x14\x00\x08\x00\x08' in header:
        return '.jar'
    elif 'PK' in header:
        return '.zip'
    elif 'PDF' in header:
        return '.pdf'
    else:
        return '.bin'     
    
def parse_out_data(f_handle):
    parsed_data = []
    data = f_handle.read()
    if len(data):
        addr_http_200 = [tmp.start() for tmp in re.finditer('HTTP/1\.1 200 OK',data)]
        if len(addr_http_200) == 0:
            # return if single file in stream
            parsed_data.append(data)
            return parsed_data
        
        # multiple files in the stream 
        else:
            # get first file located at addr 0
            index = 0
            #for x in addr_http_200: print hex(x),
            #print      
            for c, addr in enumerate(addr_http_200):
                # index = start, addr is the next HTTP/1.1 200 OK
                parsed_data.append(data[index:addr])
                newline_addr = data[addr:-1].find('\x0d\x0a\x0d\x0a')
                index = addr + (newline_addr + 4)
                if c+1 == len(addr_http_200):
                    parsed_data.append(data[index:])
                    
            return parsed_data
    else:
        # length is zero no data to parse out 
        return None 
    
    
def main():
    check_tcpflow_ver()
    try:
        with open(sys.argv[1]) as f: pass
    except Exception:
        print "\t[ERROR] File could not be accessed"
        sys.exit(1)
    p = sub.Popen(['tcpflow', '-o', 'tcpflow_out','-AH', '-r', sys.argv[1]], stdout=sub.PIPE, stderr=sub.PIPE)
    p.wait()
    dire = os.path.join(os.getcwd() + '/tcpflow_out/')
    for infile in os.listdir(dire):
        f = open(dire + infile, 'rb')
        if 'HTTPBODY' not in infile:
            continue
        parsed_results = parse_out_data(f)
        if parsed_results == None:
            continue 
        else:
            for emb_files in parsed_results:
                ex = ext(emb_files[:20])
                if 'bin' in ex:
                    continue
                o = open(MD5(emb_files)+ex,'wb')
                o.write(emb_files)
                o.close()
        
if __name__ == '__main__':
   main()

2 comments:

  1. I tried your script, However the md5sum of the file carved out is different than the actual file which is downloaded via browser.

    ReplyDelete
    Replies
    1. Do to the use of tcpflow and manual file extraction it is not surprising the hashes don't match. If you are trying to match md5 hashes I would not recommend using my code. Odds are Bro, NetworkMiner, etc would be a better choice. Cheers.

      Delete