Deobfuscating BlackHole V2 HTML Pages with Python


Obfuscated Code
This post is a quick walk through in deobfuscating the HTML pages of the BlackHole Exploit kit version 2. In the image above, at first glance the obfuscation looks intimidating but we can deobfuscate this in less than 25 lines of Python. The first item that should be noted is the location of the large amount of random looking data.  Next is the data is not between the start and end "div" tags. If we tried to use Python's HTMLParser or some other HTML parsing tools it would not parse the data. This is likely due to the data being stored in the area that elements are typically stored in. This large amount of random looking data is a good sign something is bad. There are two components for deobfuscation, the first is the data and the second is the algorithm. We can see the data above. If we followed the data long enough we would start to notice a pattern of char 'a' + int + '=' + ' " ' + data + ' " '. If we kept following the data eventually we would hit the algorithm.

Algorithm
The first part is simple, let's get all attributes with 'a' + index and append them to r. We saw a similar pattern previously. The second part is a regular expression that removes any pattern in r that does not fall into the following [^012a-z-3-9] expression. The final part of the algorithm reads 2 characters, then changes the 2 chars from base 33 to base 10 and then gets the ASCII representation and appends them to a buffer. After that we will have the following representation.


Here's the Python code. This code won't work on every sample. Hopefully it's a good enough example to show you how to deobfuscate the HTML from the command line.

import re
from StringIO import StringIO
f = open('ded90f8567245c37f1b038e7f8dd3355.html', 'r')
d = f.read()
regex = re.compile(r'<div.*></div>', re.S)
vars = re.compile(r'\".+?\"', re.S)
bhPat = re.compile(r'[^012a-z3-9]',re.S)
k = re.search(regex,d )
parsed = k.group(0)
o = ''
for x in re.findinter(vars, k.group(0)):
    o = o + x.group(0)[1:-1]
o2nd = bhPat.sub( '', o)
o2nd = StringIO(o2nd)
t = ''
while True:
    a = o2nd.read(2)
    if not s:
        break
    t = t + chr(int(a,33))
    
outf = open("out.html", 'w')
outf.write(t)
out.close()


For anyone wanting a quick visual of what the new BlackHole Exploit kit URLs look like please see below.
These URLS are created using a file called words.dat. The dat file contains 4,378 different words (Pastebin LINK). There are usually three to four words with separators of either an underscore (_) or a dash (-).

Original HTML of the BlackHole Kit - Link
Deobfuscated version  - Link 


5 comments:

  1. Good one !!!!. Excellent

    ReplyDelete
  2. Hello,

    very good and brief explanation.

    I tried the same with shell scripts and C-programming and looked for a way to calculte the changing radix from the input.

    The digits can run from 0 to 9 and a to z. When you assume that the highest digit is used in the figure the radix can be guessed by scanning the input after the deletion of all none allowed characters. This works for all landing pages I got.

    See my solution under : How to deobfuscate Blackhole Java-Script

    ReplyDelete
  3. "These URLS are created using a file called words.dat." -- where/how did you get this info and list?

    ReplyDelete
    Replies
    1. The words.dat came from a site hosting a blackhole exploit kit with an open directory.

      Delete
  4. Hello, today I found a slightly changed method of obfuscation. They do not insert random characters but pairs of characters starting with an "=".

    You can expand the program by adding a delete expression before deleting the random characters and the program will work for both.

    They although used more than 100 attributes in the tag.


    I modified my own shell script in this way.

    See my modified script at: Blackhole with new Obfuscation

    ReplyDelete