Automating AV signature generation

Hey all,
I finally get around to writing about our automated byte signature generator. It’s going to be a bird’s eye view, so if you’re interested you’ll have to read Christian’s thesis (in German) or wait for our academic paper (in English) to be accepted somewhere.

First, some background: One of the things we’re always working on at zynamics is VxClass, our automated malware classification system. The underlying core that drives VxClass is the BinDiff 3 engine (about which I have written elsewhere). An important insight about BinDiff’s algorithms is the following:

BinDiff works like an intersection operator between executables.

This is easily visualized as a Venn diagram: Running BinDiff on two executables identifies functions that are common to both executables and provides a mapping that makes it easy to find the corresponding function in executable B given any function in A.

Two executables and the common code

This intersection operator also forms the basis of the similarity score that VxClass calculates when classifying executables. This means that the malware families that are identified using VxClass share code. (Footnote: It might seem obvious that malware families should share code, but there is a lot of confusion around the term “malware family”, and before any confusion arises, it’s better to be explicit)

So when we identify a number of executables to be part of a cluster, what we mean is that pairwise, code is shared — e.g. for each executable in the cluster, there is another executable in the cluster with which it shares a sizeable proportion of the code. Furthermore, the BinDiff algorithms provide us with a way of calculating the intersection of two executables. This means that we can also calculate the intersection of all executables in the cluster, and thus identify the stable core that is present in all samples of a malware family.

What do we want from a “traditional” antivirus signature ? We would like it to match on all known samples of a family, and we would like it to not match on anything else. The signature should be easy to scan for — ideally just a bunch of bytes with wildcards.

The bytes in the stable core form great candidates for the byte signature. The strategy to extract byte sequences then goes like this:

  1. Extract all functions in the stable core that occur in the same order in all executables in question
  2. From this, extract all basic blocks in the stable core that occur in the same order in all executables in question
  3. From this, extract (per basic block) the sequences of bytes that occur in the same order in all executables in question
  4. If any gaps occur, fill them with wildcards

Sounds easy, eh ? 🙂 Let’s understand the first step in the process better by looking at a diagram:

Four executables and the results of BinDiff between them

The columns show four different executables – each green block represents one particular function. The black lines are “matches” that the BinDiff algorithms have identified. The first step is now to identify functions that are present in each of the executables. This is a fairly easy thing to do, and if we remove the functions that do not occur everywhere from our diagram, we get something like this:

Only functions that appear everywhere left

Now, we of course still have to remove functions that do not appear in the same order in all executables. The best way to do this is using a k-LCS algorithm.

What is a k-LCS algorithm ? LCS stands for longest common subsequence – given two sequences over the same alphabet, an LCS algorithm attempts to find the longest subsequence of both sequences. LCS calculations form the backbone of the UNIX diff command line tool. A natural extension of this problem is finding the longest common subsequence of many sequences (not just two) – and this extension is called k-LCS.

This suffers from the slight drawback that k-LCS on arbitrary sequences is NP-hard — but in our particular case, the situation is much easier: We can simply put an arbitrary ordering on the functions, and our “k-LCS on sequences” gets reduced to “k-LCS on sequences that are permutations of each other” — in which case the entire thing can be efficiently solved (check Christian’s diploma thesis for details). The final result looks like this:

Functions that occur in the same order

Functions that occur in the same order everywhere

Given the remaining functions, the entire process can be repeated on the basic block level. The final result of this is a list of basic blocks that are present in all executables in our cluster in the same order. We switch to a fast approximate k-LCS algorithm on the byte sequences obtained from these basic blocks. Any gaps are filled with “*”-wildcards.

The result is quite cool: VxClass can automatically cluster new malicious software into clusters of similarity – and subsequently generate a traditional AV signature from these clusters. This AV signature will, by construction, match on all members of the cluster. Furthermore it will have some predictive effect: The variable parts of the malware get whittled away as you add more executables to generate the signature from.

We have, of course, glossed over a number of subtleties here: It is possible that the signature obtained in this manner is empty. One also needs to be careful when dealing with statically linked libraries (otherwise the signature will have a large number of false positives).

So how well does this work in practice ?

We will go over a small case study now: We throw a few thousand files into our VxClass test system and run it for a bit. We then take the resulting clusters and automatically generate signatures from them. Some of the signatures can be seen here — they are in ClamAV format, and of course they work on the unpacked binaries — but any good AV engine has  a halfway decent unpacker anyhow.

I will go through the process step-by step for one particular cluster. The cluster itself can be viewed here. A low-resolution shot of it would be the following:

The cluster we're going to generate a signature for

So how do the detection rates for this cluster in traditional AVs look ? Well, I threw the files into VirusTotal, and created the following graphic indicating the detection rates for these files: Each row in this matrix represents an executable, and each column represents a different antivirus product (I have omitted the names). A yellow field at row 5 and column 10 means “the fifth executable in the cluster was detected by the tenth AV”, a white field at row 2 and column 1 means “the second executable in the cluster was not detected by the first AV”.

Detection Matrix for 32 samples from the cluster
Rows are executables
Columns are AV engines
. .
. . . . . . . . . . . . . .
. . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . .
. .
. .
. .
. . .
. . . . . . . .
. . . . . . . .
. . . . . . .
. . . . .
. . . . . . . .
. . . . . . . . . . . . . .
. . . . . . .
. . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . . . .
. . . . . . . .
. . . . .
. . . . . . .
. . . . . . . . . . . . . .
. . . . . .
. . . . . . .
. . . . . . . . . . . . .
. . . . . . . .
. . . . . . .
. . . . . . . . . .
. . . . . .

What we can see here is the following:

  1. No Antivirus product detects all elements of this cluster
  2. Detection rates vary widely for this cluster: Some AVs detect 25 out of 32 files (78%), some … 0/32 (0%)

If we inspect the names along with the detection results  (which you can do in this table), we can also see which different names are assigned by the different AVs to this malware.

So, let’s run our signature generator on this cluster of executables.

time /opt/vxclass/bin/vxsig “7304 vs 9789.BinDiff” “9789 vs 10041.BinDiff” “10041 vs 10202.BinDiff” “10202 vs 10428.BinDiff” “10428 vs 10654.BinDiff” “10654 vs 10794.BinDiff” “10794 vs 11558.BinDiff” “11558 vs 11658.BinDiff” “11658 vs 12137.BinDiff” “12137 vs 12434.BinDiff” “12434 vs 12723.BinDiff” “12723 vs 13426.BinDiff” “13426 vs 13985.BinDiff” “13985 vs 13995.BinDiff” “13995 vs 14007.BinDiff” “14007 vs 14023.BinDiff” “14023 vs 14050.BinDiff” “14050 vs 14100.BinDiff” “14100 vs 14107.BinDiff” “14107 vs 14110.BinDiff” “14110 vs 14145.BinDiff” “14145 vs 14235.BinDiff” “14235 vs 14240.BinDiff” “14240 vs 14323.BinDiff” “14323 vs 14350.BinDiff” “14350 vs 14375.BinDiff” “14375 vs 14378.BinDiff” “14378 vs 14415.BinDiff” “14415 vs 14424.BinDiff” “14424 vs 14486.BinDiff” “14486 vs 14520.BinDiff” “14520 vs 14549.BinDiff” “14549 vs 14615.BinDiff” “14615 vs 14700.BinDiff”

The entire thing takes roughly 40 seconds to run. The resulting signature can be viewed here.

So, to summarize:

  1. Using VxClass, we can quickly sort new malicious executables into clusters based on the amount of code they share
  2. Using the results from our BinDiff and some clever algorithmic trickery, we can generate “traditional” byte signatures automatically
  3. These signatures are guaranteed to match on all executables that were used in the construction of the signature
  4. The signatures have some predictive power, too: In a drastic example we generated a signature from 15 Swizzor variants that then went on to detect 929 new versions of the malware
  5. These are post-unpacking signatures — e.g. your scanning engine needs to do a halfways decent job at unpacking in order for these signatures to work

If you happen to work for an AV company and think this technology might be useful for you, please contact 🙂

23 Responses to “Automating AV signature generation”

  1. This is awesome. The key is that this works on unpacked binaries. How about encoded shellcode in unpacked binaries…would this method be able to determine such stub programs.

  2. Chopstick says:

    For encoded shellcode, I guess at the very least it may detect the encoding algorithm. Not much to go on, but might be enough. I’m guessing malware authors will often reuse bits of tough to write code.

    • Thomas Dullien says:


      yeah, if we do not unpack, the above algorithms generate signatures for the unpacker stub…

  3. Jamie Butler says:


    Great work as always! Have you automated steps 1 and 2 so that if a binary is submitted a byte sequence/”signature” is returned? This technology would definitely make AV more efficient and help with accuracy/completeness.

    Hope to see you at CanSecWest.

    • Thomas Dullien says:

      Hey Jamie,

      glad you like it 🙂 — yes, the first two steps are automated as well. We don’t do “submit-return sig yet”, but instead you can do “submit – get cluster – get signature for cluster” 😉


      • Jamie Butler says:

        Cool. Are the byte sequences returned register neutral? I know that you have focused on that before when doing your difference analysis, but obviously that would mean the byte signature would have to be translated into its literal permutations for AV to search for it.


    • Thomas Dullien says:

      Since we whittle away all the variation between the samples, the sequences will be register neutral if two executables in the cluster got very different register assignment.

  4. DumDum says:

    Hm. Metamorphic == game over, right?

    Also: what kind of checks does VxClass perform during the building of said clusters, to maybe say “hm. Maybe I should put sample B aside for a second, come back to it in a 2nd round of analysis . . . ”

    Looking at your ABCD diagram. If, by step 2, VxClass had separated B from the sample set, you’ll end with a higher accuracy sig for ACD – you can always then come back, pick B again, and generate a “lesser quality/more general” signature.

    • Thomas Dullien says:

      Hey DumDum,

      metamorphic does not necessarily mean “game over”. You can have a metamorphic engine that still generates stable patterns.

      Regarding the logic for splitting/reordering clusters: If you come up with an empty signature, it is relatively easy to quantify which samples created large “losses”, and remove them first.

  5. iVictor says:


    Not sure if you are aware, Cisco CSA had been performing this functionality for a considerable time. Is there something distinct about this? I’d like to know.

    Good information and analysis though.

    Best Regards,

    • Thomas Dullien says:

      Hey iVictor,

      I have been around the ASA code base a fair bit and as far as I know it doesn’t do anything comparable. Which exact feature are you referring to ? 🙂


  6. Thomas, he said CSA (cisco security agent), which is HIPS, and you wrote ASA, a firewall. So you appear to be talking about two different things.

    Thanks for the blog, it’s interesting to see automated approaches and as I get more interested in malware anlaysis, such processes are going to become more relevant.

    Curt Wilson

    • Thomas Dullien says:

      Ah, right, my bad. What feature in CSA does this though?

      Isn’t CSA the artist formerly known as Okena ?

  7. iVictor says:

    Hey Thomas,

    CSA has this functionality for couple of years now..automatic signature generation for MSRPC, LPC and exception based exploits. CSA detected Conficker in 15 sec by automatically generating the signature for it. It’s just that there’s no article or blog post about it.

    Best Regards,

    • Thomas Dullien says:

      Hey iVictor,

      are those behavioral signatures or byte signatures ? I thought that Okena/CSA were behavior-monitoring tools…


    • Thomas Dullien says:

      Ok, I read up on the topic. It is correct that CSA has some functionality to generate byte signatures for shellcodes that were caught executing in MSRPC and LPC attacks.

      The way I read the documentation on it, things seem to like this:

      1) The CSA tries to automatically detect intrusions that target specific RPC calls through the usual behavior-based heuristics.
      2) The CSA inspects arguments to RPC/LPC calls. If it decides an attack is running, it uses the arguments to the RPC/LPC call as signature. If it detects multiple attacks, the kLCS of the arguments is used as new signature
      3) The signatures are then created locally (and shared locally) and finally pushed “up” in the hierarchy

      The commonalities between this and what we’re doing are limited though. Admittedly, both approaches use a kLCS algorithm somewhere in the construction of a signature, but aside from that, things are quite different: The CSA approach can only generate signatures for arguments to RPC/LPC calls, and doesn’t really know how to tell data from code etc. — it is (technically) not much more than the things discussed in Kephart’s and Arnolds 1994 VB paper. But point taken: For RPC/LPC-based attack, the CSA approach should yield excellent results.

    • Thomas Dullien says:

      So the question was: What is distinct between the two approaches 🙂

      1) The CSA has access to all arguments that are received upon invocation of an RPC or LPC call, and attacks for the same underlying vulnerability are very likely to have common subfeatures.
      2) The CSA doesn’t operate on executable data at all — it can inspect all arguments (including the string that is overflowing a buffer etc.) and then determine a “common pattern” in a number of potentially different attacks on the same vulnerability.
      3) Since there is no code, and the arguments to RPC or LPC calls are going to be comparatively short, CSA can just compute a kLCS over the strings themselves.
      4) The signatures that are thus generated will prevent the exploitation of a particular bug.

      The risk of false positives is relatively low in CSA, too, as there is usually less RPC traffic than files in a system, and malicious arguments to RPC calls are -really- differently-looking than proper arguments (much longer, differently-shaped etc.)

      Our stuff is in a quite different situation to start out with: We have a bunch of -executables- (potentially of large size) on which the calculation of a string-based k-LCS is not feasible computationally. These executables can be megabytes in size, but aside from that, they are not much different from “legitimate” executables.

      The demand on low false-positive rates is much higher for AV signatures: You scan many more files than RPC requests, and the “bad” pieces of code are much more similar to the “good” pieces of code than in the RPC case.

      So, in short: We face a problem that is a fair bit harder, and thus have to roll out more algorithmic “big guns”.

  8. caf says:

    It seems like you could potentially get much better results if you didn’t have to reduce signatures down to the simplistic “bytes and wildcards” format.

    Have you put any thought into writing an AV engine that used the BinDiff algorithm internally, so that the signatures could remain in BinDiff-internal-representation format?

    • Thomas Dullien says:

      Hey caf,

      it is true that a BinDiff-style algorithm would be even more resilient — but the advantage of regular byte signatures is that all the AVs -can- already scan for them, whereas it is unclear how BinDiff-style algorithms can be made to run at speeds that are comparable to byte matching…


  9. Hi,

    Sounds like a great work!, some month ago I started to code a similar tool, which works with FuzzyHashing and supports packer detection. The main scope of this tool was precisely to identify in a first time common static pattern, successively filter and refine results, to avoid FPs deriving from binaries that as particular common structures like Delphi/VisualBasic ones..

    again Great Work Thomas 🙂

    Giuseppe ‘Evilcry’ Bonfa’

  10. […] investigate this, we have to remember another blog post where I described some of our results on generating “smart” signatures (this appears to […]

  11. […] blogged about VxClass and our algorithms for automated generation of byte signatures here and here and here. I have also blogged about private signatures beforehand, a concept that I think has […]

  12. lucrari de dizertatie…

    […]Automating AV signature generation «[…]…