Archive for the ‘VxClass’ Category

VxClass 1.5

2011-02-15

Today, we release a new version of our malware clustering solution: zynamics VxClass 1.5[1].

Compared to the previous release (VxClass 1.3)  and aside from fixing tons of bugs, we improved version 1.5 with new and upgraded system software, updated the BinDiff-based differ component and finally switched to IDA Pro 6.0.

Here are some more things we changed:

  • Updated the base system to the new Debian 6.0 (“Squeeze”)
  • Upgraded to Python 2.7.1 for VxClass base code
  • Upgraded to IDA Pro 6.0 and IDAPython 1.4.3, pefile 1.2.10-96
  • Updated the diffing engine to the newest BinDiff version
  • Performance improvements when using VxClass in a cluster
  • Huge performance improvements: A single machine can now easily process 8000 samples a day.
  • New top-10 most similar visualization (more below)

And of course this release also includes the signature generation component Thomas blogged about here, here and here.

With improved performance, mining the data of a VxClass system that contains tens of thousands of malware samples provides more and more insight. Unfortunately, up until now, this has been more difficult than necessary. For example, answering questions like “What are the 10 most-similar malware samples for this particular sample?” usually meant writing custom Python scripts that access the built-in XMLRPC interface.
Starting with this release, we will continue to add more visualization options for the data in the system. We start off simple with the top 10 most-similar and Venn diagram visualizations.

Showing the top 10 most-similar samples for a malware binary

When using the “family tree” applet that has been present in VxClass since the first release, another “caveat” of being able to process tens of thousands of samples becomes apparent: humans cannot visually process several thousand clusters organized in a graph. Thus, the next release will contain an alternate visualization for the “family tree”. I’ll leave the details to another blog post.

All existing customers within their support periods will receive an e-mail regarding the upgrade in the next few days.

[1] Following the naming tradition of Ubuntu, VxClass releases are code-named with an alliteration. This release is code-named “exploited emu” which follows after version 1.3’s “malicious monkey”.

Memoryze + VxClass vs Zeus

2011-01-27

I have previously blogged about VxClass and our algorithms for automated generation of byte signatures here and here and here. I have also blogged about private signatures beforehand, a concept that I think has great relevance for defense against (and response to) targeted attacks. One point I left open in the previous blog post was the following question:

How do I actually use the byte signatures in a real-world scenario?

In this post, I will answer that question: We will use the latest version of Mandiant’s Memoryze and VxClass and walk through the entire process of memory acquisition, malware classification, signature generation and signature deployment. In detail, our steps are going to be the following:

  1. We will pre-populate a VxClass system with a good quantity of Zeus samples
  2. We will examine the clusters generated from this
  3. We infect a previously clean XP with a new Zeus sample
  4. We identify the suspicious memory sections that were created by Zeus using Memoryze and AuditViewer, two great (and free!) tools that Mandiant has released
  5. We acquire the injected memory sections from the infected system and submit them to VxClass
  6. VxClass recognizes the similarity to previously submitted Zeus samples and generates a byte signature to detect the entire cluster of both old and new Zeus variants. This signature is unique to us, and can be used to detect infections by this malware on any Windows machine.
  7. The signature is then fed into Memoryze to identify the infection on other machines

Wow, that’s quite a list. So where do we start? We start out by examining a small set of about 180 malware samples that we pre-populated a fresh VxClass with. These samples were labeled “Zeus” or “Zbot” by some anti-virus software, so we assume they are Zeus variants.VxClass has generated family trees, and assigned most of the files to these trees.

An overview of the family trees in the system

In the next two pictures, it will become evident what the “similarity score” between two samples means: It indicates how much overlap there is between the code of the two executables under consideration. To further illustrate the issue, we drew some Venn diagrams illustrating the overlap between the highlighted samples on the right hand side.

Two pieces of malware and how they overlap

The next screenshot shows where in the family tree these two items are located (you might have to load the non-thumbnail version to actually see this):

But enough of this. As a next step, I infect a vanilla XP machine with a new random sample that was labelled Zeus — one that VxClass has never seen before. After having done this, I use Memoryze (and AuditViewer, a nice GUI for it) to identify the memory sections that have been injected into various processes. To do this, you have to click through a few dialogs that ask you to configure the following things:

  1. The location of Memoryze
  2. The output path for the data
  3. Whether you wish to work from live memory (yes)

In the first step, we just want to identify those processes that Zeus injected itself into – we will acquire the injected memory separately. We hence check “Process Enumeration” and leave the “Acquisition” fields unchecked. Finally, we enable “Memory Sections” and “Detect Injected DLLs” to be enumerated for each process. AuditViewer now launches Memoryze, and after a brief waiting period we can inspect the results. AuditViewer is nice enough to highlight the processes that contain suspicious injected DLLs in red. Furthermore, we can immediately spot the problematic section of memory, because AuditViewer has highlighted it, too.

AuditViewer highlights the suspicious memory regions

In the next step, we simply upload the .VAD file of this memory section to our VxClass box and look whether this is in any way similar to the code already in the box. And no surprise: It is quite similar to a number of other executables in the database:

The .VAD file was clustered close to other executables

In the next step, we will want to generate a traditional byte signature for all executables in the relevant cluster. This is pretty easy — highlight them, assign a tag to them, and then right-click “create signature” (screenshot below).

Making VxClass generate a signature automatically

A few seconds later, the signature is ready and can be downloaded (in ClamAV format) from the “Signatures” tab in the VxClass web UI. The generated signature is the following:

cluster.C.worm:0:*:81ec*000000*8b75*ff75*
56ff15*0033*8945*0f84*558bec83ec10565733f668*
0033ff8975fcff15*0033c08945f4393d*0076*463b35*
0072c7*8b7d*8b7d08e8*837d*007413*ff7608e8*
feffff8bc889*85c9*3c20740b*013845fe7444*
c645fe00e9e1000000*558bec81ec*020000*535633*
0000*0033*8d85*fdffff*5089*ff15*8d85*b301*
83f907750d*0085c075*6a07585068*8b7d188d043b8d440014*
8bf08975*020000*8365*00566a0033c0c645ff00e8*
06000085c00f85*010000*0fb645fd0fb70cc5*008b3d

This is clearly much longer than what would be strictly required – we usually generate much longer signatures than the bare minimum in order to minimize the risk of false positives. The cool thing about the signature is that it is private — e.g. no other user of VxClass would get the same signature (to be precise: the probability of another user getting the same signature is astronomically small). This means that unless you share this signature (like I did above), it remains as a “secret weapon” in your arsenal: The malware authors do not know what signature you are going to detect them with, so they can’t intentionally “break” this signature.

Now, an important question remains unanswered:

How does one deploy such a signature?

We have two options: We can use AuditViewer to configure Memoryze again, or we can simply run Memoryze from the command line with the appropriate configuration file to scan through the physical memory of a machine. In order to use AuditViewer, simply configure it as usual: Provide it with the location of Memoryze and tell it to analyze live memory. After you have done so, you mark the “enumerate processes” checkbox, and finally, you tell AuditViewer about the byte pattern you wish to search for:

Configuring AuditViewer to scan for a signature

AuditViewer will then launch Memoryze in the background which will scan through all processes and identify those that contain the pattern in question.

AuditViewer saves the current configuration for Memoryze in a file called out.xml in your Memoryze directory. This means you can simply make a copy of out.xml in your Memoryze directory and re-use it on other machines without having to re-run AuditViewer. Simply install Memoryze on a machine and then launch “Memoryze -script out.xml -o <outputdir>“.

You now have a new way of detecting variants of the malware that was used to attack you. But best of all: This method is secret – the signature isn’t shared with the wider world – and the attacker therefore has a much harder time immunizing his attack tools prior to the next attack.

To summarize: Combining VxClass with Memoryze and AuditViewer makes the acquisition and correlation of malicious code easy – but best of all, these tools also provide a quick and convenient way to automatically generate high-quality detection mechanisms that are kept secret from the attackers.

ReCon slides – “Packer Genetics: The Selfish Code” & Bochs+Python

2010-07-16

A few days ago Jose and Ero presented in ReCon some of the latest ideas they have been working on regarding unpacking. We have put our slides up for your viewing pleasure here:

Our slides are also available for download here. Beware that they are merely a visual aid to our live presentation. We will try to remember to announce when the ReCon video comes out so you can follow them there.

In addition, Jose will be presenting on the topic in SysCan Taipei on August 20th. That will be another good chance to catch the info fresh and live.

Bochs and Python

Bochs and our custom Python extensions were one of the fundamental tools onto which we built our research.

Ero has been keeping the Python extensions up to date for a few years and they are something we use a lot at zynamics. We have attempted to make them public in a few occasions (an old patch is available in the Bochs mailing list) but those attempts failed to make them known to more users. We are frequently reminded at conferences that people would love to play with them, so this time we are making them available through a zynamics GitHub project. The plan is to keep them in sync with all major releases of Bochs. In the GitHub page you can find basic instructions on how to get them working. The patch to apply to the current public version of Bochs (2.4.5 at this time) can be found here

We will add usage examples to the GitHub wiki as time allows. Also if there are special requests we will try to provide exemples on how to use the extensions for those cases. Download them, play with them and let us know your thoughts.

Las Vegas & the zynamics team

2010-07-14

Along with RECon, the single most important date in the reverse engineering / security research community is the annual Blackhat/DefCon event in Las Vegas. Most of our industry is there in one form or the other, and aside from the conference talks, parties and award ceremonies, there’s also a good amount of technical discussions (in bars or elsewhere) that takes place.

This year, a good number of researchers/developers from the zynamics Team will be present in Las Vegas — alphabetically, the list is:

  1. Ero Carrera
  2. Thomas Dullien/Halvar Flake
  3. Vincenzo Iozzo
  4. Tim Kornau

So, if you wish meet any of the team to discuss reverse engineering, our technologies, our research, or the performance of the Spanish or German football team at the last world cup, do not hesitate to drop an email to info@zynamics.com — Vegas is always chaotic, and scheduling a meeting will minimize stress for everyone that is involved.

Specifically, the following topics are specifically worth meeting over:

  1. Chat with Ero over our unpacking engine (just presented at RECon) — and how it fits into the larger scheme of things (e.g. VxClass)
  2. Meet with Tim or Vincenzo to discuss automated gadget-finding for ROP, or anything involving the ARM/REIL translations
  3. Meet with Thomas/Halvar to discuss VxClass, automated malware clustering, automated generation of “smart” malware signatures etc.

Aside from this, if you are interested in …

  • … boosting your reverse engineering performance by porting symbols from FOSS software into your closed-source disassemblies (BinDiff)
  • … becoming faster at finding bugs by leveraging differential debugging, the REIL intermediate language and static analysis frameworks (BinNavi)
  • … enhancing team-based reverse engineering by pooling accumulated knowledge and sharing information (BinCrowd)
  • … automatically correlating and clustering malware and forensically obtained memory dumps, and automatically deriving detection mechanisms (VxClass)
  • … analyzing malicious PDF files including the embedded JavaScript code (PDF Dissector)

then do not hesitate to drop us mail — we’ll gladly show/explain what our tools/technologies can do.

See you there !

Exploring malware relations

2010-04-13

Peter Silberman and I will be presenting, this Wednesday at BlackHat Europe 2010, our latest peek into the malware world.

In this talk we follow up from our previous “State of Malware: Explosion of the Axis of Evil” , where we explored some of the idiosyncrasies of mass-malware vs. the ones in targeted malware. This time around we decided to try to spot actual code-sharing among a reasonably diverse corpus of recent malware obtained from VirusTotal.

Using some of the technology we’ve developed in zynamics to automatically classify malware we attempted to find relations between malware samples that would indicate that code had been shared, possibly indicating collaborative efforts of malware developers or even giving clues about common authorship.

If you want to hear what it is that we found out, don’t hesitate in dropping by our talk Wednesday 16:45h

See you there!

Challenging conventional wisdom on AV signatures (Part 1 of 2)

2010-04-02

We have all been taught (and intuitively felt) that traditional antivirus signatures are, for the most part, a waste of time. I think I have myself argued something similar repeatedly. One could say that “byte signatures don’t work” is accepted conventional wisdom in the security industry. Especially in the light of the recent (and much-publicized) Aurora-attacks, this conventional wisdom appears to ring truer than ever.

One thing though that I have personally always liked about the security industry was the positive attitude towards challenging conventional wisdom — re-examining the assumptions underlying this wisdom. In this post (and the upcoming sequel), I will do just that: I will examine the reasons why everyone is convinced that traditional byte signatures do not work and ask questions about the assumptions that lead us to this conclusion.

So. Why do we think that traditional antivirus signatures are a waste of time ?

Let’s first recapitulate what the usual cycle in a targeted attack consists of:

  1. The attacker writes or obtains a backdoor component
  2. The attacker writes or obtains an exploit
  3. The attacker tests both exploit and backdoor against available AV tools, making sure that both are not detected
  4. The attacker compromises the victim and starts exfiltrating data
  5. The defender notices the attack, passes the backdoor to the AV company, and cleans up his network
  6. The AV company generates a signature and provides it to both the attacker and defender
  7. Goto (2)

This entire cycle can be clarified with a few pictures:

Let’s look at this diagram again. What properties of “byte signatures” does the attacker exploit in immunizing his software ? Well … none, really, except the fact that he fully knows about them before launching his attack. There is no information asymmetry: The attacker has almost the exact same information that the defender has. Through this, he is provided with a virtually limitless number of trial runs of his attack, and he can adapt his attack arbitrarily, over long time periods, until he is reasonably certain that it will be both successful and undetected.

The implication of this is that the underlying problem is not a feature of byte signatures, but rather a generic problem inherent in all security systems that provide identical data to both attackers and defenders and that have no information asymmetry at all.

In the next post, I will examine two approaches how this problem could be addressed to give the defender an advantage.

VxClass, automated signature generation, RSA 2010

2010-02-24

Everybody is convening in San Francisco next week for RSA2010 it appears — the big annual cocktail & business card exchange event. If you are interested in any of our technology (automated malware classification, automated signature generation, BinDiff, BinNavi) and would like to meet up with me, please contact info@zynamics.com 🙂

Automating AV signature generation

2010-02-17

Hey all,
I finally get around to writing about our automated byte signature generator. It’s going to be a bird’s eye view, so if you’re interested you’ll have to read Christian’s thesis (in German) or wait for our academic paper (in English) to be accepted somewhere.

First, some background: One of the things we’re always working on at zynamics is VxClass, our automated malware classification system. The underlying core that drives VxClass is the BinDiff 3 engine (about which I have written elsewhere). An important insight about BinDiff’s algorithms is the following:

BinDiff works like an intersection operator between executables.

This is easily visualized as a Venn diagram: Running BinDiff on two executables identifies functions that are common to both executables and provides a mapping that makes it easy to find the corresponding function in executable B given any function in A.

Two executables and the common code

This intersection operator also forms the basis of the similarity score that VxClass calculates when classifying executables. This means that the malware families that are identified using VxClass share code. (Footnote: It might seem obvious that malware families should share code, but there is a lot of confusion around the term “malware family”, and before any confusion arises, it’s better to be explicit)

So when we identify a number of executables to be part of a cluster, what we mean is that pairwise, code is shared — e.g. for each executable in the cluster, there is another executable in the cluster with which it shares a sizeable proportion of the code. Furthermore, the BinDiff algorithms provide us with a way of calculating the intersection of two executables. This means that we can also calculate the intersection of all executables in the cluster, and thus identify the stable core that is present in all samples of a malware family.

What do we want from a “traditional” antivirus signature ? We would like it to match on all known samples of a family, and we would like it to not match on anything else. The signature should be easy to scan for — ideally just a bunch of bytes with wildcards.

The bytes in the stable core form great candidates for the byte signature. The strategy to extract byte sequences then goes like this:

  1. Extract all functions in the stable core that occur in the same order in all executables in question
  2. From this, extract all basic blocks in the stable core that occur in the same order in all executables in question
  3. From this, extract (per basic block) the sequences of bytes that occur in the same order in all executables in question
  4. If any gaps occur, fill them with wildcards

Sounds easy, eh ? 🙂 Let’s understand the first step in the process better by looking at a diagram:

Four executables and the results of BinDiff between them

The columns show four different executables – each green block represents one particular function. The black lines are “matches” that the BinDiff algorithms have identified. The first step is now to identify functions that are present in each of the executables. This is a fairly easy thing to do, and if we remove the functions that do not occur everywhere from our diagram, we get something like this:

Only functions that appear everywhere left

Now, we of course still have to remove functions that do not appear in the same order in all executables. The best way to do this is using a k-LCS algorithm.

What is a k-LCS algorithm ? LCS stands for longest common subsequence – given two sequences over the same alphabet, an LCS algorithm attempts to find the longest subsequence of both sequences. LCS calculations form the backbone of the UNIX diff command line tool. A natural extension of this problem is finding the longest common subsequence of many sequences (not just two) – and this extension is called k-LCS.

This suffers from the slight drawback that k-LCS on arbitrary sequences is NP-hard — but in our particular case, the situation is much easier: We can simply put an arbitrary ordering on the functions, and our “k-LCS on sequences” gets reduced to “k-LCS on sequences that are permutations of each other” — in which case the entire thing can be efficiently solved (check Christian’s diploma thesis for details). The final result looks like this:

Functions that occur in the same order

Functions that occur in the same order everywhere

Given the remaining functions, the entire process can be repeated on the basic block level. The final result of this is a list of basic blocks that are present in all executables in our cluster in the same order. We switch to a fast approximate k-LCS algorithm on the byte sequences obtained from these basic blocks. Any gaps are filled with “*”-wildcards.

The result is quite cool: VxClass can automatically cluster new malicious software into clusters of similarity – and subsequently generate a traditional AV signature from these clusters. This AV signature will, by construction, match on all members of the cluster. Furthermore it will have some predictive effect: The variable parts of the malware get whittled away as you add more executables to generate the signature from.

We have, of course, glossed over a number of subtleties here: It is possible that the signature obtained in this manner is empty. One also needs to be careful when dealing with statically linked libraries (otherwise the signature will have a large number of false positives).

So how well does this work in practice ?

We will go over a small case study now: We throw a few thousand files into our VxClass test system and run it for a bit. We then take the resulting clusters and automatically generate signatures from them. Some of the signatures can be seen here — they are in ClamAV format, and of course they work on the unpacked binaries — but any good AV engine has  a halfway decent unpacker anyhow.

I will go through the process step-by step for one particular cluster. The cluster itself can be viewed here. A low-resolution shot of it would be the following:

The cluster we're going to generate a signature for

So how do the detection rates for this cluster in traditional AVs look ? Well, I threw the files into VirusTotal, and created the following graphic indicating the detection rates for these files: Each row in this matrix represents an executable, and each column represents a different antivirus product (I have omitted the names). A yellow field at row 5 and column 10 means “the fifth executable in the cluster was detected by the tenth AV”, a white field at row 2 and column 1 means “the second executable in the cluster was not detected by the first AV”.

Detection Matrix for 32 samples from the cluster
Rows are executables
Columns are AV engines
. .
. . . . . . . . . . . . . .
. . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . .
. .
. .
. .
. . .
. . . . . . . .
. . . . . . . .
. . . . . . .
. . . . .
. . . . . . . .
. . . . . . . . . . . . . .
. . . . . . .
. . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . . . .
. . . . . . . .
. . . . .
. . . . . . .
. . . . . . . . . . . . . .
. . . . . .
. . . . . . .
. . . . . . . . . . . . .
. . . . . . . .
. . . . . . .
. . . . . . . . . .
. . . . . .

What we can see here is the following:

  1. No Antivirus product detects all elements of this cluster
  2. Detection rates vary widely for this cluster: Some AVs detect 25 out of 32 files (78%), some … 0/32 (0%)

If we inspect the names along with the detection results  (which you can do in this table), we can also see which different names are assigned by the different AVs to this malware.

So, let’s run our signature generator on this cluster of executables.

time /opt/vxclass/bin/vxsig “7304 vs 9789.BinDiff” “9789 vs 10041.BinDiff” “10041 vs 10202.BinDiff” “10202 vs 10428.BinDiff” “10428 vs 10654.BinDiff” “10654 vs 10794.BinDiff” “10794 vs 11558.BinDiff” “11558 vs 11658.BinDiff” “11658 vs 12137.BinDiff” “12137 vs 12434.BinDiff” “12434 vs 12723.BinDiff” “12723 vs 13426.BinDiff” “13426 vs 13985.BinDiff” “13985 vs 13995.BinDiff” “13995 vs 14007.BinDiff” “14007 vs 14023.BinDiff” “14023 vs 14050.BinDiff” “14050 vs 14100.BinDiff” “14100 vs 14107.BinDiff” “14107 vs 14110.BinDiff” “14110 vs 14145.BinDiff” “14145 vs 14235.BinDiff” “14235 vs 14240.BinDiff” “14240 vs 14323.BinDiff” “14323 vs 14350.BinDiff” “14350 vs 14375.BinDiff” “14375 vs 14378.BinDiff” “14378 vs 14415.BinDiff” “14415 vs 14424.BinDiff” “14424 vs 14486.BinDiff” “14486 vs 14520.BinDiff” “14520 vs 14549.BinDiff” “14549 vs 14615.BinDiff” “14615 vs 14700.BinDiff”

The entire thing takes roughly 40 seconds to run. The resulting signature can be viewed here.

So, to summarize:

  1. Using VxClass, we can quickly sort new malicious executables into clusters based on the amount of code they share
  2. Using the results from our BinDiff and some clever algorithmic trickery, we can generate “traditional” byte signatures automatically
  3. These signatures are guaranteed to match on all executables that were used in the construction of the signature
  4. The signatures have some predictive power, too: In a drastic example we generated a signature from 15 Swizzor variants that then went on to detect 929 new versions of the malware
  5. These are post-unpacking signatures — e.g. your scanning engine needs to do a halfways decent job at unpacking in order for these signatures to work

If you happen to work for an AV company and think this technology might be useful for you, please contact info@zynamics.com 🙂

Automated signature generation for malware (teaser & help needed)

2010-02-02

Hey all,

I promised a while ago on my personal blog that I would write about the work that has been done here at zynamics regarding the automated extraction of malware signatures. Full details are coming up in the next two to three weeks, but before that, I’d like to ask you, dear reader, for a favour:

We have a number of automatically generated ClamAV signatures here, and while we can test them for false positives locally, our “goodware”-zoo is clearly limited. We would much appreciate if you could take these autogenerated signatures and try to see whether they match on any program that is “goodware”, e.g. known to not be malware.

You can use the above file by simply running “clamscan -d ./auto.generated.sigs.ndb”

Personally, I am really curious to see if any of the signatures end up creating false positives…

Cheers,

Halvar/Thomas

Introducing: The official zynamics blog! :-)

2010-01-18

Dear Readers (and fellow reverse engineers),

welcome to the shiny new zynamics blog!

Over the last several years, most of the zynamics crew has kept their own (personal) blogs, and frequently, topics that were of interest to the reverse engineer were scattered over several different blogs. It was not unusual to have to search through my blog, Ero’s blog, SP’s blog, or Vincenzo’s blog on the quest to finding a particular piece of information.

Also, at least one of those blogs was updated only sporadically (primarily … mine), and intermingled heavily with non-technical rants on the state of the world or the quality of the food in some random pub.

This situation was clearly untenable — and we therefore decided to pool all our reverse-engineering (and zynamics)-related stuff in one place.

On this blog, you will find posts regarding the following topics:

  • General reverse engineering
  • Bug hunting
  • Interesting uses of BinNavi / BinDiff
  • Automated malware classification / signature generation
  • Other things that I can’t think of yet, but that will certainly crop up in due time

So, enjoy the posts, and tell a friend!

Cheers,

Halvar