Archive for August, 2010

PDF Dissector 1.6.0 released

Monday, August 30th, 2010

Today we are releasing a new version of our PDF malware analysis tool PDF Dissector. This release fixes two PDF parsing bugs reported by our customers. The first bug led to problems when PDF files were using unexpected null-bytes in the PDF file. The second parsing bug led to problems with unexpected PDF comments.

Especially that second parsing bug was very interesting. A customer sent us a PDF malware file that strategically placed PDF comment strings everywhere to confuse PDF parsers. To be able to analyze this file manually, it was also necessary to add a new feature to PDF Dissector. It is now possible to hide PDF comment strings from the PDF browsing tree. Just take a look at the two screenshots below to see why this is really useful.

Obfuscated PDF file without comment string hiding

Obfuscated PDF file with comment string hiding

To learn more about PDF Dissector, please visit the product site or the PDF Dissector manual.

The REIL language – Part IV

Tuesday, August 24th, 2010

In the first three parts of this series I covered different aspects of our Reverse Engineering Intermediate Language REIL. First, I gave a short overview of REIL and what we are trying to achieve with it. Then, I talked about the REIL language itself. Finally, I talked about the REIL translators that turn native assembly code into REIL code. In this fourth part of the series I want to talk about MonoREIL.

MonoREIL is an abstract interpretation framework based on REIL. We have added this framework to BinNavi to help our customers build their own static binary code analysis algorithms. Since MonoREIL provides a standard implementation of an abstract interpretation algorithm, users of MonoREIL only have to implement a few interfaces that define implementation-specific aspects of their custom analysis algorithm.

MonoREIL algorithms operate on so-called instruction graphs. These graphs represent the REIL code you want to analyze. Each node of the graph represents a single REIL instruction. The edges between nodes represent potential control flow transfers between REIL instructions. This graph structure is provided by the MonoREIL framework itself so you do not have to create this graph yourself.

What you have to do is to specify the order in which the instruction graph will be traversed. Some algorithms are easier to implement starting at the entry node of a function and working towards the exit node. Others benefit from starting at the exit node and walking backwards.

If you want to implement an abstract interpretation algorithm yourself, you have to think about the abstract program state you want to model. Imagine you want to track whether a register influences other registers further down the control flow. In this case, your abstract program state would be a set of influenced registers.

In the first step of your algorithm you assign concrete instances of this abstract program state to each node of the instruction graph. These initial program states reflect prior knowledge you already have about the program before the analysis algorithm is run. For the register tracking algorithm you probably do not have any prior knowledge beyond what register you want to track. The instruction graph node where register tracking begins is initialized with the set the contains just the register you want to track. All other nodes are assigned the empty set.

Tracking EAX: Start (green), use (blue), overwritten (red)

Now MonoREIL begins to traverse the graph in the direction you have specified earlier. For each node it encounters on the way, the input abstract program state and the REIL instruction in the node are used to determine the output abstract program state for the node. This transformation is completely dependent on your concrete analysis algorithm and you have to implement the so-called transformation function yourself. For example, if  the register tracking algorithm is tracking the register set {eax} and comes across the instruction “add eax, 5, ebx”, the transformation function generates the output set {eax, ebx} because from this instruction on, both the registers eax and ebx depend on eax in some way. If the next instruction is “xor ecx, ecx, eax”, the input set {eax, ebx} is changed to {ebx} because the xor instruction sets eax to a value that does not depend on any of the input registers. After the xor instruction, eax does not depend in any way on the tracked registers and can be removed from the register set.

This is where the reduced instruction set of REIL really begins to shine. Since there are only 17 different instructions, you have to implement at most 17 different transformation functions that turn the input abstract program state of a node into an output abstract program state. In practice, you will have to implement far fewer transformation functions because most algorithms can handle transformations for multiple instructions uniformly (just think about register tracking and the add, sub, mul, div instructions).

The transformation functions are sufficient to do abstract interpretation along a single control flow path. However, they can not be used when multiple control flow paths converge into a single control flow path. To handle this special case, all MonoREIL algorithms have to implement a function that can take an arbitrary number of abstract program states and merge them into a single program state. If the register tracking algorithm receives the set {eax} from one path and the set {ebx} from another path, the input set of the node where these control flow paths merge is the union of the two input sets {eax, ebx}. In the node where the two control flow paths merge, both registers are tainted.

That’s pretty much what you have to do to implement your own code analysis algorithm. There are a few minor implementation details you have to consider to make sure that your MonoREIL algorithm terminates and returns the right result. For example, you have to make sure that your abstract program states never lose previously gained information. If they do, it would be possible to lose information in one step and regain it in the next step, leading to infinite loops where information is constantly lost and regained. This monotonous property of abstract program states is what gave MonoREIL its name.

Once you have defined all of the mentioned interfaces, you can tell MonoREIL to run your analysis algorithm. MonoREIL traverses the instruction graph and applies abstract program state transformations along the control flow path. This process continues until no new information about the program state can be discovered. For each node of the instruction graph, MonoREIL then returns the final abstract program state for the node. This information is the result of the analysis algorithm and can further be interpreted and displayed in the GUI by your custom analysis code.

If you want to know more about MonoREIL you can check out the slides of our CanSecWest 2009 presentation Platform-independent static binary code analysis using a meta-assembly language where we implemented a MonoREIL algorithm to detect array access operations with negative array indices. MonoREIL is also mentioned in the slide sets Applications of the reverse engineering language REIL, Automated static deobfuscation in the context of Reverse Engineering, and The Reverse Engineering Language REIL and its Applications.

BinDiff 3.2 public beta phase starts today

Tuesday, August 17th, 2010

Because this is my first post here I would like to introduce myself briefly. I have been working for zynamics since 2006 and my primary task is Java development. My main product responsibilities are the VxClass similarity graph visualization applet, parts of BinNavi and the complete BinDiff GUI.

Today we are officially starting the BinDiff 3.2 public beta phase. Besides many bug fixes the quality of the diff engine has been improved. Also, this version is shipped with a new C++ based exporter plugin for IDA which unifies the export process between BinNavi and BinDiff.

Here is an excerpt of the change list:

  • Added new matching algorithms (e.g. loop head matching)
  • Improved performance (better utilization of multi-core machines)
  • Support for multipe simultaneous differ instances
  • Memory optimizations
  • Function matches can be deleted and added manually
  • Colored matched function table rows (according to their similarity)
  • Various usability changes like wait dialogs for long running operations
  • ~80 more fixes and improvements in the exporter
  • ~50 more fixes and improvements in the BinDiff core engine
  • ~15 more fixes and improvements in the BinDiff GUI

We also increased our test database of IDB samples for stress testing the exporter and the BinDiff core engine to ~10GB. This includes large non-malicious samples (CISCO router images, Firefox, Acrobat, …), patch diff samples (Microsoft, MMO games, …), a huge selection of small but heavily obfuscated malware samples and some really obscure platforms. Every IDB we have ever received with a bug report or feature request is included in this database – please do keep ’em coming! In addition to that our distributed BinDiff engine integrated with zynamics VxClass, our malware clustering product, churns through gigabytes of fresh malware samples each month.

If you are a zynamics customer and interested in participating in the beta phase, please write an e-mail to support@zynamics.com. Any kind of feedback, feature requests or bug reports are very much appreciated.

Challenging conventional wisdom on AV signatures (Part 2 of 2)

Friday, August 13th, 2010

A while ago I posted a blog entry called “challenging conventional wisdom on AV signatures (Part 1 of 2)”. There, I argued that the fundamental problem with traditional AV byte signatures is the of lack of information asymmetry: The defender and the attacker both have access to the same information, and the attacker can run a potentially infinite number of test-runs to make sure he can reliably bypass all of defensive measures the defender has taken.

The important thing to take away from that blog post is that the problem with AV signatures is not inherent to “signatures” – it is a matter of information symmetry.

Now, how can one change this situation? Is there a clever way to make traditional byte signatures useful again? Can we somehow introduce information asymmetry in a productive manner?

To investigate this, we have to remember another blog post where I described some of our results on generating “smart” signatures (this appears to be AV lingo for signatures that are not checksum-based, but which consist of bytes and wildcards). The summary of this blog post is more or less: “With the algorithms underpinning VxClass, we can not only automatically cluster malicious software into groups, we can also generate signatures for each group automatically. And one signature will match the entire group.”

There was one small bit of information missing in that post that will make this post interesting: We can usually generate dozens, if not hundreds, of different signatures for the same cluster of malware. These signatures match, by construction, on all samples of a particular cluster, but they have nothing in common – they match on different bits of the code.

Where does this leave us? Well, it leaves us with a pretty cool system that we call VxClass for Financials (although it is possible to substitute ‘Financials’ with other large verticals that are often victims of targeted attacks). The system works as follows:

  1. Different financial institutions each get a user account on a centralized VxClass server
  2. Users upload malicious software that they have recovered (using tools such as Memoryze) from their own systems
  3. Users are anonymous by default
  4. Users can see how malware they upload clusters; they can also see how similar their malware is to malware other users uploaded
  5. Users can only download their own malware, not the malware of other users
  6. For each cluster, users can generate a personalized detection signature that no other user will ever see

Why is this cool ? Well, for one thing, every user profits from uploading to the system — the more samples are present in one cluster, the better the predictive power of the signatures. At the same time, users do not have to share any confidential information with each other — they are encouraged to, but they do not have to. Finally, even if some users of the system are sloppy and leak their signatures to the attacker, they only endanger themselves – everybody else has their own signatures, and will thus not be affected by this signature leak. This is important – normally, when I share methods of detection with others, I risk losing them. Not here.

We are starting an evaluation/beta program of the system in the next 1-2 weeks — at the moment, targeted at the financial sector. If you happen to find this interesting, are working for a financial institution and want to participate in our test drive, please contact us at info@zynamics.com !

++staff (as opposed to staff++)

Tuesday, August 10th, 2010

Hi everyone,

my name is Jan Newger and I’ve just recently joined the team at zynamics. Previously, I finished my diploma thesis at the RWTH Aachen. The work was about code virtualization techniques and what mechanisms can be applied to extract readable code from such a protection system. In the past, I mostly fiddled with reverse engineering and especially anti-reverse engineering techniques, wrote some libraries as well as a few IDA Pro plugins.

From now on, I’ll be working  together with Tim and Sebastian on BinNavi and I’m looking forward to develop my skillz in such an excellent team of researchers here at zynamics.

PDF Dissector 1.5.0 released

Thursday, August 5th, 2010

Apart from a few bug fixes, version 1.5.0 of our PDF malware analysis tool PDF Dissector brings two very cool new features.

The first cool new feature is that PDF Dissector now supports the decryption of RC4-encoded strings and streams. This is very useful because there are a few PDF malware samples in the wild that encrypt their strings and streams using RC4 (a standard PDF format feature). In the past, PDF Dissector was not able to analyze these PDF files. From now on, PDF Dissector can be used on those samples too.

The second cool new feature is an improvement to the plugin API that allows plugins to extend the context menu of PDF file nodes in the PDF browsing tree. This was inspired by a customer who asked for a way to generate reports with PDF Dissector. I implemented a small report generator as a Python plugin to make sure that all customers who want to generate reports can easily modify the content and the layout of the generated report.

Extended context menu for generating reports

To learn more about PDF Dissector, please visit the product site or the PDF Dissector manual.

BinNavi 3.0 released

Monday, August 2nd, 2010

We are proud to announce that BinNavi 3.0, the latest version of our graph-based reverse engineering tool, was released today. Previous versions of BinNavi have already helped reverse engineers in the IT security industry, in governmental agencies, and academia around the world do their jobs faster and better. The new version of BinNavi will definitely by very welcomed by existing and new customers because it is a huge improvement compared to BinNavi 2.2.

I have already talked about the most important new features in a previous blog post when the first beta of BinNavi 3.0 was announced. For example, I mentioned the improved exporter that is used to get information from IDA Pro into BinNavi. Back then I was excited about processing about 80.000 functions per hour. Between the first beta and the final release we managed to improve this number again. For the screenshots below I exported more than 110.000 functions in just 15 minutes. This improvement alone makes me never want to use earlier versions of BinNavi again.

Let’s take a look at a few new screenshots of BinNavi 3.0.

The BinNavi main window

The general layout of the main window stayed the same. On the left-hand side you can configure databases to work with. In the screenshot you can see that I configured a database called Remote Database that contains 37 modules (disassembled files). These modules can then be loaded to browse through the disassembled code. The right-hand side of the main window changes depending on what node of the project tree you have selected. In the screenshot you can see detailed information for the module kernel32.dll.

Analyzing disassembled code

The second screenshot shows the so-called graph window where you analyze disassembled code. BinNavi always shows disassembled code as flow graphs. This makes it easier for reverse engineers to follow the control flow. In the screenshot I have changed the default color of some basic blocks and told BinNavi to highlight all function call instructions.

Debugging a Cisco 2600 router

The third screenshot shows a BinNavi debugging session. The debug target is a Cisco 2600 router. The next instruction to be executed is highlighted in the graph. On the left-hand side you can see the current register values. On the bottom you can see a dump of the RAM and the current stack.

Collecting program state in trace mode

In cases when manual debugging is not enough, you can switch to trace mode. In trace mode, breakpoints are set on all nodes of the open graph and whenever a breakpoint is hit, the current register values and a small memory snapshot are recorded. This feature is so useful for quickly isolating interesting code that you should really check out the video that demonstrates this feature.

That’s it for now. If you want to learn more about BinNavi please check out the product website and the manual. If you are having any questions please leave a comment or contact the zynamics support.