extflow.py version 2

Note: The IP addresses have been removed
The original script had some logic flaws that I didn't realize. Thankfully, a helpful Redditor by the name of pi-rho pointed them out. The flaws had to deal with gzipped streams, multiple files in a stream and using the old version of tcpflow.

What is extflow.py? It's script that uses tcpflow to separate a pcap into streams and then checks the headers of the streams for file signatures. There are a number of similar tools but I have found them to get confused when scanning files that contain embedded files. An example would be a PDF that contains a SWF. In the image above we can see a pcap that was used to capture the payload from a drive by exploit. The first thing we will need in order to run extflow.py is the newest version of tcpflow. The newest version can be downloaded from the following link. Most repositories will not have the new version if you try installing via yum or apt-get. I'd recommend doing a wget, decompress, a ./configure, make, then make install and we should be ready to go. Once tcpflow is installed we will just need to pass the script the pcap.  The script will call tcpflow with the -AH -r flags.
  • -AH : extract HTTP objects and unzip GZIP-compressed HTTP messages
  • -r: read packets from tcpdump pcap file (may be repeated)
All the streams will be written to a directory named tcpflow_out in the working directory.  All streams that contain a known header will be written to the working directory.  This can be very useful when having to carve out an executable file from a stream.

To do:
 Integrate hachior-subfile for identifying files.

extflow.py  - download

#!/usr/bin/env python

# extflow.py v.2 created by alexander.hanel@gmail.com
# This is a simple script that will carve out files
# from streams created by tcpflow.

# find ../ -name 'dump.pcap' -exec ./ext-flow.py {} \; 

import hashlib
import os.path
import sys
import re
from StringIO import StringIO
import subprocess as sub

def MD5(d):
# d = buffer of the read file 
# This function hashes the buffer
# source: http://stackoverflow.com/q/5853830
    if type(d) is str:
      d = StringIO(d)
    md5 = hashlib.md5()
    while True:
        data = d.read(128)
        if not data:
    return md5.hexdigest()

def check_tcpflow_ver():
    p = sub.Popen(['tcpflow', '-V'], stdout=sub.PIPE, stderr=sub.PIPE)
    out = p.communicate()[0]
    # for longevity reasons, this is crappy logic to check for the version 
    if 'tcpflow 1.' not in out and 'tcpflow 2.' not in out:
        print "\t[ERROR] Please download 1.0 or higer"
        print "\tDownload: https://github.com/simsong/tcpflow/"

def ext(header):
    # To add a new signature add your own elif statement
    #    elif 'FILE SIGNATURE' in header:
    #    return 'FILE EXTENSION'
    if 'MZ' in header:
        return '.mz'
    elif 'FWS' in header:
        return '.swf'
    elif 'CWS' in header:
        return '.swf'
    elif 'html' in header:
        return '.html'
    elif '\x50\x4B\x03\x04\x14\x00\x08\x00\x08' in header:
        return '.jar'
    elif 'PK' in header:
        return '.zip'
    elif 'PDF' in header:
        return '.pdf'
        return '.bin'     
def parse_out_data(f_handle):
    parsed_data = []
    data = f_handle.read()
    if len(data):
        addr_http_200 = [tmp.start() for tmp in re.finditer('HTTP/1\.1 200 OK',data)]
        if len(addr_http_200) == 0:
            # return if single file in stream
            return parsed_data
        # multiple files in the stream 
            # get first file located at addr 0
            index = 0
            #for x in addr_http_200: print hex(x),
            for c, addr in enumerate(addr_http_200):
                # index = start, addr is the next HTTP/1.1 200 OK
                newline_addr = data[addr:-1].find('\x0d\x0a\x0d\x0a')
                index = addr + (newline_addr + 4)
                if c+1 == len(addr_http_200):
            return parsed_data
        # length is zero no data to parse out 
        return None 
def main():
        with open(sys.argv[1]) as f: pass
    except Exception:
        print "\t[ERROR] File could not be accessed"
    p = sub.Popen(['tcpflow', '-o', 'tcpflow_out','-AH', '-r', sys.argv[1]], stdout=sub.PIPE, stderr=sub.PIPE)
    dire = os.path.join(os.getcwd() + '/tcpflow_out/')
    for infile in os.listdir(dire):
        f = open(dire + infile, 'rb')
        if 'HTTPBODY' not in infile:
        parsed_results = parse_out_data(f)
        if parsed_results == None:
            for emb_files in parsed_results:
                ex = ext(emb_files[:20])
                if 'bin' in ex:
                o = open(MD5(emb_files)+ex,'wb')
if __name__ == '__main__':


  1. I tried your script, However the md5sum of the file carved out is different than the actual file which is downloaded via browser.

    1. Do to the use of tcpflow and manual file extraction it is not surprising the hashes don't match. If you are trying to match md5 hashes I would not recommend using my code. Odds are Bro, NetworkMiner, etc would be a better choice. Cheers.