Author Archive

PDF Dissector 1.7.0 released

2010-09-03

Today I analyzed a malicious PDF file that contained more than 1100 lines of heavily obfuscated malicious JavaScript code. To make it easier for me to deobfuscate the code, I added two new features to our PDF malware analysis tool PDF Dissector: Variable references and snapshot histories.

The variable references feature shows you where variables are used in JavaScript code. Just place the caret over a variable identifier and all lines that use that variable are shown to you. You can see what this feature looks like in the screenshot below.

Showing all uses of the variable tonsSap

The snapshot history feature allows you to take JavaScript source snapshots of known states. Later on, you can then revert to the source code to the recorded snapshots. This is very useful when you accidentally remove JavaScript code that later turns out to be needed after all. The screenshot below shows you a snapshot tree of four named snapshots I made during different states of the deobfuscation process.

Snapshot history with four snapshots

To learn more about PDF Dissector, please visit the product site or the PDF Dissector manual.

PDF Dissector 1.6.0 released

2010-08-30

Today we are releasing a new version of our PDF malware analysis tool PDF Dissector. This release fixes two PDF parsing bugs reported by our customers. The first bug led to problems when PDF files were using unexpected null-bytes in the PDF file. The second parsing bug led to problems with unexpected PDF comments.

Especially that second parsing bug was very interesting. A customer sent us a PDF malware file that strategically placed PDF comment strings everywhere to confuse PDF parsers. To be able to analyze this file manually, it was also necessary to add a new feature to PDF Dissector. It is now possible to hide PDF comment strings from the PDF browsing tree. Just take a look at the two screenshots below to see why this is really useful.

Obfuscated PDF file without comment string hiding

Obfuscated PDF file with comment string hiding

To learn more about PDF Dissector, please visit the product site or the PDF Dissector manual.

The REIL language – Part IV

2010-08-24

In the first three parts of this series I covered different aspects of our Reverse Engineering Intermediate Language REIL. First, I gave a short overview of REIL and what we are trying to achieve with it. Then, I talked about the REIL language itself. Finally, I talked about the REIL translators that turn native assembly code into REIL code. In this fourth part of the series I want to talk about MonoREIL.

MonoREIL is an abstract interpretation framework based on REIL. We have added this framework to BinNavi to help our customers build their own static binary code analysis algorithms. Since MonoREIL provides a standard implementation of an abstract interpretation algorithm, users of MonoREIL only have to implement a few interfaces that define implementation-specific aspects of their custom analysis algorithm.

MonoREIL algorithms operate on so-called instruction graphs. These graphs represent the REIL code you want to analyze. Each node of the graph represents a single REIL instruction. The edges between nodes represent potential control flow transfers between REIL instructions. This graph structure is provided by the MonoREIL framework itself so you do not have to create this graph yourself.

What you have to do is to specify the order in which the instruction graph will be traversed. Some algorithms are easier to implement starting at the entry node of a function and working towards the exit node. Others benefit from starting at the exit node and walking backwards.

If you want to implement an abstract interpretation algorithm yourself, you have to think about the abstract program state you want to model. Imagine you want to track whether a register influences other registers further down the control flow. In this case, your abstract program state would be a set of influenced registers.

In the first step of your algorithm you assign concrete instances of this abstract program state to each node of the instruction graph. These initial program states reflect prior knowledge you already have about the program before the analysis algorithm is run. For the register tracking algorithm you probably do not have any prior knowledge beyond what register you want to track. The instruction graph node where register tracking begins is initialized with the set the contains just the register you want to track. All other nodes are assigned the empty set.

Tracking EAX: Start (green), use (blue), overwritten (red)

Now MonoREIL begins to traverse the graph in the direction you have specified earlier. For each node it encounters on the way, the input abstract program state and the REIL instruction in the node are used to determine the output abstract program state for the node. This transformation is completely dependent on your concrete analysis algorithm and you have to implement the so-called transformation function yourself. For example, if  the register tracking algorithm is tracking the register set {eax} and comes across the instruction “add eax, 5, ebx”, the transformation function generates the output set {eax, ebx} because from this instruction on, both the registers eax and ebx depend on eax in some way. If the next instruction is “xor ecx, ecx, eax”, the input set {eax, ebx} is changed to {ebx} because the xor instruction sets eax to a value that does not depend on any of the input registers. After the xor instruction, eax does not depend in any way on the tracked registers and can be removed from the register set.

This is where the reduced instruction set of REIL really begins to shine. Since there are only 17 different instructions, you have to implement at most 17 different transformation functions that turn the input abstract program state of a node into an output abstract program state. In practice, you will have to implement far fewer transformation functions because most algorithms can handle transformations for multiple instructions uniformly (just think about register tracking and the add, sub, mul, div instructions).

The transformation functions are sufficient to do abstract interpretation along a single control flow path. However, they can not be used when multiple control flow paths converge into a single control flow path. To handle this special case, all MonoREIL algorithms have to implement a function that can take an arbitrary number of abstract program states and merge them into a single program state. If the register tracking algorithm receives the set {eax} from one path and the set {ebx} from another path, the input set of the node where these control flow paths merge is the union of the two input sets {eax, ebx}. In the node where the two control flow paths merge, both registers are tainted.

That’s pretty much what you have to do to implement your own code analysis algorithm. There are a few minor implementation details you have to consider to make sure that your MonoREIL algorithm terminates and returns the right result. For example, you have to make sure that your abstract program states never lose previously gained information. If they do, it would be possible to lose information in one step and regain it in the next step, leading to infinite loops where information is constantly lost and regained. This monotonous property of abstract program states is what gave MonoREIL its name.

Once you have defined all of the mentioned interfaces, you can tell MonoREIL to run your analysis algorithm. MonoREIL traverses the instruction graph and applies abstract program state transformations along the control flow path. This process continues until no new information about the program state can be discovered. For each node of the instruction graph, MonoREIL then returns the final abstract program state for the node. This information is the result of the analysis algorithm and can further be interpreted and displayed in the GUI by your custom analysis code.

If you want to know more about MonoREIL you can check out the slides of our CanSecWest 2009 presentation Platform-independent static binary code analysis using a meta-assembly language where we implemented a MonoREIL algorithm to detect array access operations with negative array indices. MonoREIL is also mentioned in the slide sets Applications of the reverse engineering language REIL, Automated static deobfuscation in the context of Reverse Engineering, and The Reverse Engineering Language REIL and its Applications.

PDF Dissector 1.5.0 released

2010-08-05

Apart from a few bug fixes, version 1.5.0 of our PDF malware analysis tool PDF Dissector brings two very cool new features.

The first cool new feature is that PDF Dissector now supports the decryption of RC4-encoded strings and streams. This is very useful because there are a few PDF malware samples in the wild that encrypt their strings and streams using RC4 (a standard PDF format feature). In the past, PDF Dissector was not able to analyze these PDF files. From now on, PDF Dissector can be used on those samples too.

The second cool new feature is an improvement to the plugin API that allows plugins to extend the context menu of PDF file nodes in the PDF browsing tree. This was inspired by a customer who asked for a way to generate reports with PDF Dissector. I implemented a small report generator as a Python plugin to make sure that all customers who want to generate reports can easily modify the content and the layout of the generated report.

Extended context menu for generating reports

To learn more about PDF Dissector, please visit the product site or the PDF Dissector manual.

BinNavi 3.0 released

2010-08-02

We are proud to announce that BinNavi 3.0, the latest version of our graph-based reverse engineering tool, was released today. Previous versions of BinNavi have already helped reverse engineers in the IT security industry, in governmental agencies, and academia around the world do their jobs faster and better. The new version of BinNavi will definitely by very welcomed by existing and new customers because it is a huge improvement compared to BinNavi 2.2.

I have already talked about the most important new features in a previous blog post when the first beta of BinNavi 3.0 was announced. For example, I mentioned the improved exporter that is used to get information from IDA Pro into BinNavi. Back then I was excited about processing about 80.000 functions per hour. Between the first beta and the final release we managed to improve this number again. For the screenshots below I exported more than 110.000 functions in just 15 minutes. This improvement alone makes me never want to use earlier versions of BinNavi again.

Let’s take a look at a few new screenshots of BinNavi 3.0.

The BinNavi main window

The general layout of the main window stayed the same. On the left-hand side you can configure databases to work with. In the screenshot you can see that I configured a database called Remote Database that contains 37 modules (disassembled files). These modules can then be loaded to browse through the disassembled code. The right-hand side of the main window changes depending on what node of the project tree you have selected. In the screenshot you can see detailed information for the module kernel32.dll.

Analyzing disassembled code

The second screenshot shows the so-called graph window where you analyze disassembled code. BinNavi always shows disassembled code as flow graphs. This makes it easier for reverse engineers to follow the control flow. In the screenshot I have changed the default color of some basic blocks and told BinNavi to highlight all function call instructions.

Debugging a Cisco 2600 router

The third screenshot shows a BinNavi debugging session. The debug target is a Cisco 2600 router. The next instruction to be executed is highlighted in the graph. On the left-hand side you can see the current register values. On the bottom you can see a dump of the RAM and the current stack.

Collecting program state in trace mode

In cases when manual debugging is not enough, you can switch to trace mode. In trace mode, breakpoints are set on all nodes of the open graph and whenever a breakpoint is hit, the current register values and a small memory snapshot are recorded. This feature is so useful for quickly isolating interesting code that you should really check out the video that demonstrates this feature.

That’s it for now. If you want to learn more about BinNavi please check out the product website and the manual. If you are having any questions please leave a comment or contact the zynamics support.

Dumping shellcode with Pin

2010-07-28

About six weeks ago, when I blogged about the Adobe Reader/Flash 0-day that was making the rounds back then, I talked about generating automated shellcode dumps with Pin. In this post I want to talk a bit about Pin, dynamic binary instrumentation, and the shellcode dumper Pintool we developed at zynamics.

Dynamic binary instrumentation is a technique for analyzing binary files by executing the files and injecting analysis code into the binary file at runtime. This method is not exactly new. It has been in use for many years already, for example in program verification, profiling, and compiler optimization. However, despite its amazing power and ease of use, dynamic binary instrumentation is still not widely used by binary code reverse engineers.

The two most important dynamic binary instrumentation tools for binary code reverse engineers are Pin and DynamoRIO. Pin is developed by Intel and provided by the University of Virginia while DynamoRIO is a collaboration between Hewlett-Packard and MIT. Both are free to use but only DynamoRIO is open source.

In general, both Pin and DynamoRIO are very similar to use. If you want to use either tool you have to write a C or C++ plugin that contains your analysis code. This code is then injected into the target process by Pin or DynamoRIO. For most reverse engineering purposes, Pin and DynamoRIO are both equally useful and when you talk to reverse engineers who make use of dynamic binary instrumentation it often seems to be a matter of personal taste which tool they prefer. At zynamics we use Pin because the API for analysis code seems cleaner to us.

Let’s get back to shellcode dumping now. The idea behind shellcode detection is rather simple: Whenever an instruction is executed, check if that instruction belongs to a section of a loaded module in the address space of the target process. If that’s the case, then the instruction is considered legit (not shellcode). If, however, the instruction is outside of any module section, and therefore most likely on the stack or allocated heap memory, the instruction is considered shellcode. This heuristic is not perfect, of course. However, it works surprisingly well in practice.

In fact, the Pintool we developed does exactly this. For every executed instruction in the target process it performs the check described in the above paragraph. Until shellcode is found, the Pintool keeps track of up to 100 legit instructions executed before the shellcode. Then, when shellcode is found, it dumps the legit instructions before the shellcode and the shellcode itself. The big value of our Pintool is not that it dumps the shellcode. The big value is that it tells you exactly where control flow is transferred from the legit code to the shellcode. With this information you can quickly find the vulnerability in the exploited program. If you are really interested in the shellcode, you can also just set a breakpoint on the last legit instruction before the shellcode and do a manual analysis from there.

You can find the complete documented source of our shellcode dumper Pintool on the zynamics GitHub. The source code is surprisingly short and I did my best to document the code. If you are having any questions about the source code, please leave a comment to this blog entry or contact me in some other way.

Let’s take a look at the output of a sample run now. You can find the full trace log on GitHub but here are the important parts.

What you always want to look for is the string “Executed before”. These are the legit instructions before control flow is transferred to the shellcode. Ignore the first occurence of this string. I have not had a deeper look what code is recognized as shellcode there, but it might be JIT-compiled Adobe Reader JavaScript code (which also does not belong to any section of any module and is therefore detected as shellcode). The second occurence of the string is what you want to look at.

Executed before
0x238C038E::EScript.api  E8 D2 72 F6 FF  call 0x23827665
0x238BBF4F::EScript.api  8B 44 24 04     mov eax, dword ptr [esp+0x4]
0x238BBF53::EScript.api  C6 40 FF 01     mov byte ptr [eax-0x1], 0x1

[ ... more EScript.api instructions ... ]

0x2D841E82::Multimedia.api  56           push esi
0x2D841E83::Multimedia.api  8B 74 24 08  mov esi, dword ptr [esp+0x8]
0x2D841E87::Multimedia.api  85 F6        test esi, esi
0x2D841E89::Multimedia.api  74 22        jz 0x2d841ead
0x2D841E8B::Multimedia.api  56           push esi

[ ... more Multimedia.api instructions ... ]

0x2D841E96::Multimedia.api  8B 10        mov edx, dword ptr [eax]
0x2D841E98::Multimedia.api  8B C8        mov ecx, eax
0x2D841E9A::Multimedia.api  FF 52 04     call dword ptr [edx+0x4]

Shellcode:
0x0A0A0A0A::  0A 0A                      or cl, byte ptr [edx]
0x0A0A0A0C::  0A 0A                      or cl, byte ptr [edx]
0x0A0A0A0E::  0A 0A                      or cl, byte ptr [edx]

[ ... more shellcode ... ]

The log clearly shows that control flow is transferred from EScript.api (the Adobe Reader JavaScript engine) to Multimedia.api (the Adobe Reader multimedia library) and then to shellcode which does not lie in any module section. You can even see that the exploit code controls the memory address of [edx + 0x4] in the last executed instruction of Multimedia.api. With this knowledge you can work back from this point to see what exactly the vulnerability is that was exploited. In the example, the vulnerability was the use-after-free media.newPlayer vulnerability of older Adobe Reader versions. This vulnerability uses JavaScript code to trigger a bug in the Multimedia API.

In case you are wondering about gaps in the instruction trace (for example at calls), please note that each instruction is only dumped once. So, if a function is called twice, the second call is not dumped to the output file anymore. This behavior was added to keep log files small.

I think the shellcode dumper is a good example for a first Pintool. The idea behind it is really simple and the actual Pintool can be improved by interested readers many ways (dump register values or improve the shellcode detection heuristic, for example). If you are making improvements to the tool, please let us know.

PDF Dissector 1.4.0 released

2010-07-22

PDF Dissector 1.4.0 (Product site / Manual) fixes a few PDF parser bugs, improves the Adobe Reader emulation, and adds a cool new feature you can use for searching through all open PDF files.

Here is the detailed list of changes:

  • Feature: Added a way to search through the content of all open files.
  • Feature: Annotation names are now correctly emulated.
  • Bugfix: Fixed a parser bug that led to missed data streams if there was a comment between an object and its stream.
  • Bugfix: Fixed a parser bug that led to crashes when parsing invalid octal numbers.
  • Bugfix: All tabs belonging to a file are now closed when closing the file.

In PDF Dissector 1.4.0 you will now find a text field above the PDF browser where you can enter text strings to search for. The search function searches through dictionary keys, dictionary values, strings, data streams and other elements of PDF files and displays only those elements that match the search string. This is very useful if you want to answer questions like ‘which PDF files of my collection use JavaScript’?

The screenshot below shows how I used the filter to search through about 50 PDF files for exactly those that use the OpenAction command to execute some code when the PDF file is opened.

The new filter function in PDF Dissector 1.4.0

The REIL language – Part III

2010-07-19

In the first and second part of this series I have given an overview of our Reverse Engineering Intermediate Language (REIL) and talked about the purpose and structure of the individual REIL instructions. This third part is about REIL code generation.

REIL is included in BinNavi to help users write their own code analysis algorithms, often based on abstract interpretation. However, obviously our users are not really interested in analyzing REIL code. What they really want is to analyze the native assembly code of their target binary file. The REIL language can nevertheless be used by all users of BinNavi. The translation from native assembly code to REIL code is just as simple and transparent as porting the results of REIL analysis back to the original code.

BinNavi 3.0 ships with REIL translators for ARM, MIPS, PowerPC, and x86 code. Code from any of these native assembly languages can be translated to REIL code with the same effect on the program state as the original code. Users can then analyze the effects of the much simpler REIL code and port the results of the analysis back to the original native assembly code they are really interested in.

The translation process from native assembly code to REIL code is very simple. Given a list of native assembly instructions, each individual instruction is translated to REIL code. In a second pass, the generated linear listing of REIL code is taken and converted into a control flow graph. This control flow graph is then passed to the user and he can use it in his code analysis algorithm.

When writing the translators, we made sure that all native assembly instructions could be translated to REIL code without needing any information about the instructions before or after the current instruction. This allowed us to keep the individual instruction translators stateless.

Generally multiple REIL instructions are emitted by the translators for any native input instruction. Consequently it is not possible to keep a 1:1 mapping between the addresses of the original assembly instructions and the addresses of their corresponding REIL instructions. The solution we found was to multiply every native address by 0x100 to calculate the base address of all REIL instructions that belong to a native instruction. The lowest byte of such REIL addresses can then be filled by an index that specifies the relative position of a single REIL instruction inside the sequence of REIL instructions generated from the same native input instruction. What I mean here is that if you have a native instruction at address 0x08 the REIL translators will generate REIL instructions for it at addresses 0x800, 0x801, 0x802, …

This way of translating native instruction addresses to REIL instruction addresses limits us to at most 256 REIL instructions for each native instruction. In practice, even getting to 30 REIL instructions for one native instruction is rare although we have a few outliers that are translated to up to 70 REIL instructions.

There is another big advantage to this address translation method. For any given REIL instruction you can immediately determine its native source instruction. There is no need to consider any additional context around the REIL instruction. This is really important if you want to port the results of a REIL analysis algorithm back to the original code. If you have determined a result for a REIL instruction you can just divide its REIL address to 0x100 and you know the original instruction for which the result holds too.

Translating x86 code to REIL code

That’s it for now. If you have any questions about the translation process or REIL in general please leave a comment.

We are hiring a new BinNavi developer

2010-07-15

After we have already hired Tim and Jose this year to join zynamics as full-time employees, we are now looking to extend our team once again. This time we are looking for a software developer who wants to join the BinNavi team.

BinNavi is a binary code reverse engineering tool that enables reverse engineers to analyze binary code. Customers of BinNavi are primarily vulnerability researchers from companies and governmental organizations of many different nations that try to find new 0-days in closed-source software.

Working on BinNavi means you will be working on a large application with more than 500.000 lines of code. The majority of that code is written in Java (the whole main program), a few ten-thousand lines of code are written in C++ (the debuggers and the IDA Pro plugin to get disassembly data from IDA Pro into a MySQL database). You are a good candidate if you know how to write clean code and you have a reverse engineering background that gives you an idea about what features are useful to our customers.

It is also crucial that you are self-motivated and have a clear idea of where development should be going. At zynamics, there is very little management from above. Rather, the individual teams (like the BinNavi team) decide themselves what features to prioritize next and when to schedule the next release of a project.

There are a few perks that make working for zynamics really pay off. There is the obvious one: you will work on the cutting edge of reverse engineering tool research and development. However, there are others. For example, you can attend as many IT security conferences as you want to provided you give a talk there and the organizers pay for your flight and hotel (which nearly all IT security conferences do, so just submit somewhere and be accepted to speak). There is a nearly unlimited budget for computer science books (all employees know the password to the corporate Amazon account and can buy at will). You will meet many of our customers who come from all walks of life and have amazing stories to tell.

Of course there are downsides to the job, too. The primary issue we face again and again when filling job positions is that we do not want any remote workers. You would have to move to Bochum, Germany for the job and work from our office (working from home two times a week or so is OK). Since we want to fill this position quickly (preferably you would start August 1st but no later than August 15th) we can not consider candidates that require a work permit that takes longer to process. Except for this, we welcome applications from software developers of all backgrounds.

Please note that this is not a reverse engineering job. On the job you will most likely not be doing a lot of reverse engineering beyond what is required to test BinNavi. What you can do, however, is to implement new code analysis algorithms that improve the usefulness of BinNavi to our customers.

If you are interested in this job, please send an email to info@zynamics.com to request more information. Or just send your resume and some piece of code you wrote that makes us want to hire you.

ReCon slides – How to really obfuscate your PDF malware

2010-07-13

Last Friday I was at ReCon in Montreal to give a talk about obfuscated PDF malware. I got the idea for the talk during my work on PDF Dissector where I saw a lot of obfuscated PDF malware. The obfuscation I saw in the wild was mostly very limited and the malware authors did not seem to think things through to the very end. I took the opportunity to think a bit further about the whole topic of PDF malware obfuscation and a few of the result of these thoughts can be seen in the slides below. If you do not have Flash enabled, click here to download the slides.