The first step is to apply tokenization to the stream of data to break it up into a smaller list/set of data. In the sentence above each element separated by whitespace characters or punctuation are broken up into tokens. The second step is to tag each token/word with it's part of speech (POS). In summary the sub-strings need to be broken up into tokens using a simple heuristic or pattern and then each token needs to be tagged with it's own part of speech.
If we were to apply tokenization to the API GetProcAddress. We would have the tokens as ["Get", "Proc", "Address"] if we used the pattern of the uppercase to delimit the start of a token. A regular expression to search for this pattern would be re.findall('[A-Z][^A-Z]*', "GetProcAddress"). This works for camel cased API names but not all APIs follow this naming convention. Some API names are all lower case but contain sub-strings that need to be broken up into tokens. Most Windows sockets APIs are lower case for example getpeername, getsockopt and recv. Then some API names contain all upper case for one sub-string, with an underscore as the next char, then lowercase and then back to camel case. An example of this WTF stayle can be see in FONTOBJ_cGetAllGlyphHandles. For each naming convention a different tokenizer would need to be written
For the second step, the POS "Get" would be a verb, "Proc" would be unknown and "Address" would be a noun. Without the previous step of tokenizing the API we would not be able to get the POS. If we don't have the POS we would not know where it is supposed to reside in a sentence. Our example of using GetProcAddress deviates from the normal NLTK process of tagging because we are not looking up the POS in the context of a sentence. Rather we are getting the POS of individual strings. A way around this is to access Princton's wordnet dictionary. We can look up the definition and POS. An issue with this approach is it will give us all multiple definitions.
If we choose the most common POS we can remove the manual aspect of tagging the token.
Since "proc" is not a valid dictionary we wouldn't have a means to label the POS and assign a meaning. In this situation we would have to manually tag the tokens. During my research I found around two hundred non-dictionary words that were manually labeled. Here is example of twenty five of them.
Now that we have the POS for the tokens we can start applying those forgotten rules of grammar to form a sentence. Here are some rules that I found during my research that I used to structure my sentences.
Sentence Rules and order:
- Adjectives stand in front of the noun or after the verb
- Adverbs frequency usually come in front of the main verb
- Preposition - verb, prep, the noun
- Pronoun - before the noun
- Nouns and objects go last.
The code can be downloaded on my bitbucket repo. I would highly recommend checking for updates often. Parsing and tokenizing the different API naming schemes has made for some interesting one off bugs. Plus I frequently update the JSON files which contain all the substrings, their POS and a replacement string. The format for the individual elements in the JSON are string:[POS, replacement] for example "ghost": ["noun", "None"]. All strings needs to lowercase, minus the replacement string. They need to be "None" if you do not want them to be replaced.
Thanks to Philip Porter for sending over his list of the most common APIs used by malware.
There is also a known issue where API names that do not have a POS with a noun will not be printed. It's hard to make a sentence without an object. A good example is "VirtualAlloc". "Virtual" is an adjective. 'Alloc" which is short for allocate is a verb. I'll be looking into the best way to fix soon but for now enjoy. Also, don't expect perfect grammar or even intelligent text ;)
Please feel free to leave comments, send me an email (address is in the source code) or ping me on twitter.