In the first part of this series I gave a brief overview of the REIL language (Reverse Engineering Intermediate Language), the intermediate language we use in our internal binary code analysis algorithms. I talked about the language in general and what motivated us to create it. In this second part I am going to talk about the REIL instruction set.
As mentioned in the first part of the series, there are only 17 different REIL instructions. We deliberately decided to reduce the instruction set as much as possible without losing too much of the semantics of the original instructions. This was important for us because we primarily use REIL code in abstract interpretation algorithms and the fewer instructions an instruction set has, the less code you have to implement in your abstract interpretation algorithms. It is much easier and faster to write code that processes the effects of 17 different instructions on the program state than it is for 100 instructions.
The REIL instruction set can loosely be divided into five different groups of instructions: Arithmetic instructions, bitwise instructions, data transfer instructions, logical instructions, and other instructions.
Arithmetic instructions
With six different instructions, the group of arithmetic instructions is the biggest group. It contains the instructions ADD (Addition), SUB (Subtraction), MUL (Unsigned multiplication), DIV (Unsigned division), MOD (Unsigned modulo), and BSH (Bitwise shift). Each of these instructions takes two input operands and one output operand where the result of the operation is stored.
[sourcecode]
add eax, 5, ebx // ebx = eax + 5
sub t0, 10, t1 // t1 = t0 – 10
mul t1, 10, t2 // t2 = t1 * 10
div t1, 10, t2 // t2 = t1 / 10
mod t1, 10, t2 // t2 = t1 % 10
bsh t1, -5, t2 // t2 = t1 << 5
[/sourcecode]
ADD and SUB work just like you would expect addition and subtraction to work. The two input operands are added or subtracted and the result of the operation is stored in the output operand.
It is a bit unusual that REIL only supports unsigned multiplication and division operations but it turned out that it is simple to simulate signed multiplication and division using their unsigned counterparts.
The BSH instruction is one of the design mistakes we made in REIL 1.0. We encoded the direction of the shift (bitwise left shift or bitwise right shift) in the sign of the second operand. If the operand is negative, a left shift is executed. If it is positive, a right shift is executed. This makes the interpretation of BSH instructions very difficult especially if the shift-amount is not a constant value (think shl eax, cl on x86) We are planning to replace the BSH instruction with two instructions, LSH and RSH, in future versions of REIL.
Bitwise instructions
The second group of instructions, the bitwise instructions, contains the three instructions AND, OR, and XOR. These instructions behave just the way you expect bitwise AND, OR, and XOR to behave. Each instruction takes two input operands and connects the bits of the input operands using the truth table of the bitwise operation specified in the mnemonic. The result of the bitwise operation is stored in the output operand.
[sourcecode]
and eax, 5, ebx // ebx = eax & 5
or t0, 10, t1 // t1 = t0 | 10
xor t1, 10, t2 // t2 = t1 ^ 10
[/sourcecode]
Data transfer instructions
The third group of instructions, the data transfer instructions, is more interesting again. It contains the instructions STR, LDM, and STM.
The STR instruction (Store Register) copies an integer literal or the content of a register to another register. The source operand is the first operand of the instruction, the target operand is the third operand.
LDM (Load Memory) and STM (Store Memory) are used to access the memory of the simulated process. In the LDM instruction, the first operand specifies the memory address from where a value is loaded. The third operand is the register operand where the loaded value is stored. In the STM instruction, the order of operands is reversed. The first operand specifies the value to be written to memory, the third operand specifies the memory address.
Both LDM and STM can access memory regions of any size in one go. In case of LDM, the size of the accessed memory equals the size of the REIL operand where the loaded value is stored. In case of STM, the size of the accessed memory equals the size of the REIL operand that contains the value to store.
[sourcecode]
str t0, , t1 // t1 = t0
ldm t0, , t1 // t1 = [t0]
stm 33, , t1 // [t1] = 33
[/sourcecode]
Logical instructions
The fourth group of instructions is the group of logical instructions. With two instructions, BISZ and JCC, this group is rather small.
The BISZ instruction (Boolean Is Zero) takes a value in the first operand and checks whether the operand is zero or not. If it is zero, the output operand of the instruction is set to one. Otherwise it is set to zero.
JCC (Jump Conditional) is the only way to execute a branch in the REIL language. The instruction takes a condition in the first operand and if the condition operand evaluates to anything but zero, control is transferred to the instruction at the address specified in the third operand.
[sourcecode]
bisz t0, , t1 // t1 = t0 == 0 ? 1 : 0
jcc t1, , t2 // jump if t1 != 0
[/sourcecode]
Other instructions
The group of other instructions contains the remaining REIL instructions that do not really fit into any other group.
The first instruction is the NOP instruction (no operation). This instruction does not have an effect on the program state. At first it seems useless to have this instruction in the REIL instruction set but it turned out that having this instruction simplifies the translation process from native assembly instructions to REIL instructions in certain edge cases. Of course, we also could have simulated NOP using the other REIL instructions.
The second instruction of this group is the UNDEF instructions. This instruction is used to indicate that a REIL register has an undefined state. The UNDEF instruction became necessary because there are x86 instructions that leave flags in an undefined state.
The third and last instruction of this group is the UNKN instruction. This instruction is a placeholder instruction that is emitted by the REIL translator every time it encounters a native assembly instruction it can not translate.
[sourcecode]
nop , , // Does nothing
undef , , eax // Marks eax as undefined
unkn , , // Translator found an unknown instruction
[/sourcecode]
A word about operands
In the example code above you have already seen that REIL operands are very simple. In fact, REIL operands can only be of three different types:
- Integer literals: Decimal, positive integer numbers
- Registers: String literals like t0 or eax
- REIL addresses: Two integer literals separated by a period character (400.20)
The purpose of integer literals and register operands is the same for REIL as it is for native assembly instructions. REIL addresses are necessary because there are certain native assembly instructions that contain branches within itself (think of the x86 REP instructions). These internal branches are simulated by REIL JCC instructions with REIL addresses as the third operand. I will talk more about REIL addresses in the third part of this series.
All REIL instructions have three of those operands. However, not all instructions require all three operands to be present. Unnecessary operands have a special type ’empty’ that is not printed when you write down a REIL instruction. That’s why in the example code above you see instructions that have operands separated by commas but without any operand names between the commas. What operands are omitted depends on the functionality of the operands. We have always tried to have the first two operands act as input operands while the third operand is the output operand. The BISZ instruction, for example, has the first and third operand but not the unnecessary second input operand.
That’s it for the second part of this series. In the next part I will talk about translating native code to REIL code.
This may be a dumb question or one you are going to handle in the next post, but how do you translate floating-point registers and instructions into REIL?
Hi Mark,
not a dumb question at all. I originally wanted to mention this but I forgot. We are not handling FPU instructions at all (they are all translated to UNKN). REIL is primarily made to find security-relevant bugs in code. FPU code pretty much never plays a role in such code.
That does not mean REIL will never be able to handle FPU code, we have just decided to postpone FPU support until FPU-based security vulnerabilities are common and we know what to look for.
The problem with FPU is that it’s based on stack, and not registers, so any RTL language is not very good for it.
I’m curious how you handle branch instructions and signedness.
Simulating a stack is not actually that difficult in REIL. Since there are both an unlimited amount of registers and an unlimited amount of memory you have enough space left to simulate pretty much everything.
Branches that depend on signedness are actually very straightforward. In case of x86 there is no branch instruction that directly checks the sign. Rather, there is some kind of signed operation before the branch that sets flags according to the result. These signed operations can be simulated using our unsigned instructions. First, REIL code is emitted that converts the original signed input values into a unsigned values. Then the operation is done on the unsigned values. Afterwards, the sign of the result is adjusted by combining the signs of the original input values and the sign of the output value. Flags are then set depending on the sign-adjusted result.
[…] REIL language – Part III By Sebastian Porst In the first and second part of this series I have given an overview of our Reverse Engineering Intermediate Language […]
[…] I gave a short overview of REIL and what we are trying to achieve with it. Then, I talked about the REIL language itself. Finally, I talked about the REIL translators that turn native assembly code into REIL code. In […]
Why are you reinventing the wheel? Things like BitBlaze already convert asm to an intermediate language, and then analyze it to automatically find tons of things, from ways to auto-unpack (renovo) to where a buffer overflow occurs (taintcheck). Are you reinventing the wheel because BitBlaze is so complicated? Well then I’ve got some bad news for you. By the time your stuff gets developed enough to be useful, it’s going to be pretty complicated too!
Your definition:
“Integer literals: Decimal, positive integer numbers”
is contradicted by your “Arithmetic instructions” example line:
“6 bsh t1, -5, t2 // t2 = t1 << 5"
I can't see how REIL would be workable if integer literals cannot be negative.
That also bring into question the definition:
"REIL addresses: Two integer literals separated by a period character (400.20)"
Are they really two integer literals (which I believe can be negative), or are they only positive for REIL addresses?
[…] REIL language» (part 1, part 2, part 3, part […]