Adventures in bruteforcing

Analyzing encodings and encryption algorithms in malware is fun. It's enjoyable to analyze the algorithm and see if it is possible to write a decoder. I have written 50+ encoders in Python for malware samples. Mainly around C2 or embedded executables. One of the most common encoding schemes seen is XOR. I love XOR for it's simplicity and malware authors do too. The only problem with XOR is I get tired of writing one-off scripts to decode some strings in a random piece of malware. Recently while reversing a piece of malware I noticed that each string was XORed with it's own key by a function. Since I already had IDA opened it was not a huge issue to create a quick one liner in IDA to XOR the strings. My next thought was what if I see this sample again? Do I really want to open it up in IDA, find the decoder function, then write a one-liner for each string passed to the function? Yeah, that sounds boring. I took a quick glance at the encoded data in a hex editor to see if each encoded string ended with the XOR key (strings end with \x00). That was a negative. The encoded data always started and ended with '\x00\x00'. Good to note. Below is an example of some encoded data that I made for this example.

Example Data:
0000090: f0ed e0b8 0000 0000 0000 0000 0000 0000  ................
00000a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000b0: 576b 6a70 236a 7023 6e62 646a 6023 736f  Wkjp#jp#nbdj`#so
00000c0: 6260 6600 0000 0000 0000 0000 0000 0000  b`f.............
00000d0: 635f 5e44 175e 4417 5617 4352 4f43 1b17  c_^D.^D.V.CROC..
00000e0: 5859 5b4e 1756 1743 5244 4319 1906 0504  XY[N.V.CRDC.....
00000f0: 0300 0000 0000 0000 0000 0000 0000 0000  ................

As most people know XOR is substitution cypher. A major flaw of substitution cyphers is they can be broken either by frequency analysis or bruteforcing. I have my doubts that frequency analysis would be a good way to tackle this problem. Frequency analysis is useful for dictionary text but what about a URL such as http://google.com? It would be an interesting task to do frequency analysis of ascii strings extracted from compiled exeuctables files. If we were to dump all the ASCII strings in *.dlls and *.exe files in system32 we would get about 60 mb of strings. Maybe another post. This leaves us with the option of bruteforcing. The main problem with bruteforcing data for valid strings is the how do we know the range of the encoded data and if it's a valid string and not a false positives. With our data example above we know that the encoded data range starts and ends with '\x00'. We could create a regular expression to find all data that falls between those characters. Our regular expression pattern would be '\x00(?!\x00).+?\x00'. If the ranges were different we would need another regular expression pattern. First problem solved. Now lets assume that we have XORed each byte in the block of data with a static key between 0x1 and 0xff. Now we are at the second problem. How do we know if the XORed data is valid text? We know that a valid string must be printable ascii. If we XOR a block of data and the data contains a non-printable ASCII characters we can ignore that set. That will eliminate about 0x9b or 155 values between 0x0 - 0xff. This will make our set of false positives a little bit smaller.

Function for testing if valid Ascii:
def valid_ascii(char):
        if char in string.printable[:-3]:
                return True
        else:
                return None 

Output of string.printable

>>> string.printable
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

Let's see what the output  would be from bruteforcing the example above would look like.

0xaf key 0x1  Vjkq"kq"oceka"rncag 
0xaf key 0x2  Uihr!hr!l`fhb!qm`bd 
0xaf key 0x3  This is magic place     !!! Valid String
0xaf key 0x4  Sont'nt'jf`nd'wkfdb 
0xaf key 0x5  Rnou&ou&kgaoe&vjgec 
0xaf key 0x6  Qmlv%lv%hdblf%uidf` 
0xaf key 0x7  Plmw$mw$iecmg$thega 
0xaf key 0x8  _cbx+bx+fjlbh+{gjhn 
0xaf key 0x9  ^bcy*cy*gkmci*zfkio 
0xaf key 0xa  ]a`z)`z)dhn`j)yehjl 
0xaf key 0xb  \`a{(a{(eioak(xdikm 
0xaf key 0xd  Zfg}.g}.coigm.~bomk 
0xaf key 0xe  Yed~-d~-`ljdn-}alnh 
0xaf key 0x12  Eyxb1xb1|pvxr1a}prt 
0xaf key 0x13  Dxyc0yc0}qwys0`|qsu 
0xaf key 0x16  A}|f5|f5xtr|v5eytvp 
0xaf key 0x17  @|}g4}g4yus}w4dxuwq 
0xaf key 0x18  Osrh;rh;vz|rx;kwzx~ 
0xaf key 0x1a  Mqpj9pj9tx~pz9iuxz| 
0xaf key 0x1c  Kwvl?vl?r~xv|?os~|z 
0xaf key 0x1e  Iutn=tn=p|zt~=mq|~x 
0xaf key 0x29  ~BCY
CY
GKMCI
ZFKIO 
0xaf key 0x2a  }A@Z    @Z    DHN@J    YEHJL 
0xaf key 0x5d  
67-~7-~3?97=~.2?=; 
0xaf key 0x5e      54.}4.}0<:4>}-1<>8 
0xcf key 0x22  A}|f5|f5t5apma95z{yl5t5apfa;;$'&! 
0xcf key 0x23  @|}g4}g4u4`ql`84{zxm4u4`qg`::%&'  
0xcf key 0x25  Fz{a2{a2s2fwjf>2}|~k2s2fwaf<<# !& 
0xcf key 0x28  Kwvl?vl?~?kzgk3?pqsf?~?kzlk11.-,+ 
0xcf key 0x2a  Iutn=tn=|=ixei1=rsqd=|=ixni33,/.) 
0xcf key 0x2b  Htuo<uo<}<hydh0<srpe<}<hyoh22-./( 
0xcf key 0x2c  Osrh;rh;z;o~co7;tuwb;z;o~ho55*)(/ 
0xcf key 0x2e  Mqpj9pj9x9m|am59vwu`9x9m|jm77(+*- 
0xcf key 0x2f  Lpqk8qk8y8l}`l48wvta8y8l}kl66)*+, 
0xcf key 0x32  Qmlv%lv%d%q`}q)%jki|%d%q`vq++4761 
0xcf key 0x33  Plmw$mw$e$pa|p($kjh}$e$pawp**5670 
0xcf key 0x34  Wkjp#jp#b#wf{w/#lmoz#b#wfpw--2107 
0xcf key 0x35  Vjkq"kq"c"vgzv."mln{"c"vgqv,,3016 
0xcf key 0x36  Uihr!hr!`!udyu-!nomx!`!udru//0325 
0xcf key 0x37  This is a text, only a test..1234      !!! Valid String
0xcf key 0x38  [gf|/f|/n/{jw{#/`acv/n/{j|{!!>=<; 
0xcf key 0x39  Zfg}.g}.o.zkvz".a`bw.o.zk}z  ?<=: 
0xcf key 0x3a  Yed~-d~-l-yhuy!-bcat-l-yh~y##<?>9 
0xcf key 0x3d  ^bcy*cy*k*~or~&*edfs*k*~oy~$$;89> 
0xcf key 0x3e  ]a`z)`z)h)}lq}%)fgep)h)}lz}''8;:= 
0xcf key 0x3f  \`a{(a{(i(|mp|$(gfdq(i(|m{|&&9:;< 
0xcf key 0x69  
67-~7-~?~*;&*r~102'~?~*;-*ppolmj 
0xcf key 0x6a      54.}4.}<})8%)q}231$}<})8.)ssloni
The above output was created using iheartxor (see below)

Yeah, that's a lot of extra data. We have a 456 bytes file with an output size of 2,090 bytes. That's a 400% increase in size. How can we slim this down? Let's say we did this on a large file such as Notepad.exe with the same data pasted at the end. The original Notepad.exe has a size of 69,280 bytes and the output is 117,400 bytes. That's a 69% increase. Who want's to look through that much data for a couple of XORed strings? This is exactly why people use string searches. To be continued...I think I have to go ask Reddit.

Introducing iheartxor
iheartxor is a tool that can be used to bruteforce xor encoded strings within a user defined data block via a regular expression pattern (-r). The default search pattern is a regular expression that searches for data between null bytes ('\x00'). The tool can also be used to do a straight xor on a file with -f file.name -k value. The value must between 0x0-0x255. The tool is still in development. Please leave comments or ideas. I'm still trying to figure out the best way to detect valid ASCII text without using a search strings such as xorsearch.


No comments:

Post a Comment