In the first part of this series I gave a brief overview of the REIL language (Reverse Engineering Intermediate Language), the intermediate language we use in our internal binary code analysis algorithms. I talked about the language in general and what motivated us to create it. In this second part I am going to talk about the REIL instruction set.
As mentioned in the first part of the series, there are only 17 different REIL instructions. We deliberately decided to reduce the instruction set as much as possible without losing too much of the semantics of the original instructions. This was important for us because we primarily use REIL code in abstract interpretation algorithms and the fewer instructions an instruction set has, the less code you have to implement in your abstract interpretation algorithms. It is much easier and faster to write code that processes the effects of 17 different instructions on the program state than it is for 100 instructions.
The REIL instruction set can loosely be divided into five different groups of instructions: Arithmetic instructions, bitwise instructions, data transfer instructions, logical instructions, and other instructions.
Arithmetic instructions
With six different instructions, the group of arithmetic instructions is the biggest group. It contains the instructions ADD (Addition), SUB (Subtraction), MUL (Unsigned multiplication), DIV (Unsigned division), MOD (Unsigned modulo), and BSH (Bitwise shift). Each of these instructions takes two input operands and one output operand where the result of the operation is stored.
[sourcecode]
add eax, 5, ebx // ebx = eax + 5
sub t0, 10, t1 // t1 = t0 – 10
mul t1, 10, t2 // t2 = t1 * 10
div t1, 10, t2 // t2 = t1 / 10
mod t1, 10, t2 // t2 = t1 % 10
bsh t1, -5, t2 // t2 = t1 << 5
[/sourcecode]
ADD and SUB work just like you would expect addition and subtraction to work. The two input operands are added or subtracted and the result of the operation is stored in the output operand.
It is a bit unusual that REIL only supports unsigned multiplication and division operations but it turned out that it is simple to simulate signed multiplication and division using their unsigned counterparts.
The BSH instruction is one of the design mistakes we made in REIL 1.0. We encoded the direction of the shift (bitwise left shift or bitwise right shift) in the sign of the second operand. If the operand is negative, a left shift is executed. If it is positive, a right shift is executed. This makes the interpretation of BSH instructions very difficult especially if the shift-amount is not a constant value (think shl eax, cl on x86) We are planning to replace the BSH instruction with two instructions, LSH and RSH, in future versions of REIL.
Bitwise instructions
The second group of instructions, the bitwise instructions, contains the three instructions AND, OR, and XOR. These instructions behave just the way you expect bitwise AND, OR, and XOR to behave. Each instruction takes two input operands and connects the bits of the input operands using the truth table of the bitwise operation specified in the mnemonic. The result of the bitwise operation is stored in the output operand.
[sourcecode]
and eax, 5, ebx // ebx = eax & 5
or t0, 10, t1 // t1 = t0 | 10
xor t1, 10, t2 // t2 = t1 ^ 10
[/sourcecode]
Data transfer instructions
The third group of instructions, the data transfer instructions, is more interesting again. It contains the instructions STR, LDM, and STM.
The STR instruction (Store Register) copies an integer literal or the content of a register to another register. The source operand is the first operand of the instruction, the target operand is the third operand.
LDM (Load Memory) and STM (Store Memory) are used to access the memory of the simulated process. In the LDM instruction, the first operand specifies the memory address from where a value is loaded. The third operand is the register operand where the loaded value is stored. In the STM instruction, the order of operands is reversed. The first operand specifies the value to be written to memory, the third operand specifies the memory address.
Both LDM and STM can access memory regions of any size in one go. In case of LDM, the size of the accessed memory equals the size of the REIL operand where the loaded value is stored. In case of STM, the size of the accessed memory equals the size of the REIL operand that contains the value to store.
[sourcecode]
str t0, , t1 // t1 = t0
ldm t0, , t1 // t1 = [t0]
stm 33, , t1 // [t1] = 33
[/sourcecode]
Logical instructions
The fourth group of instructions is the group of logical instructions. With two instructions, BISZ and JCC, this group is rather small.
The BISZ instruction (Boolean Is Zero) takes a value in the first operand and checks whether the operand is zero or not. If it is zero, the output operand of the instruction is set to one. Otherwise it is set to zero.
JCC (Jump Conditional) is the only way to execute a branch in the REIL language. The instruction takes a condition in the first operand and if the condition operand evaluates to anything but zero, control is transferred to the instruction at the address specified in the third operand.
[sourcecode]
bisz t0, , t1 // t1 = t0 == 0 ? 1 : 0
jcc t1, , t2 // jump if t1 != 0
[/sourcecode]
Other instructions
The group of other instructions contains the remaining REIL instructions that do not really fit into any other group.
The first instruction is the NOP instruction (no operation). This instruction does not have an effect on the program state. At first it seems useless to have this instruction in the REIL instruction set but it turned out that having this instruction simplifies the translation process from native assembly instructions to REIL instructions in certain edge cases. Of course, we also could have simulated NOP using the other REIL instructions.
The second instruction of this group is the UNDEF instructions. This instruction is used to indicate that a REIL register has an undefined state. The UNDEF instruction became necessary because there are x86 instructions that leave flags in an undefined state.
The third and last instruction of this group is the UNKN instruction. This instruction is a placeholder instruction that is emitted by the REIL translator every time it encounters a native assembly instruction it can not translate.
[sourcecode]
nop , , // Does nothing
undef , , eax // Marks eax as undefined
unkn , , // Translator found an unknown instruction
[/sourcecode]
A word about operands
In the example code above you have already seen that REIL operands are very simple. In fact, REIL operands can only be of three different types:
- Integer literals: Decimal, positive integer numbers
- Registers: String literals like t0 or eax
- REIL addresses: Two integer literals separated by a period character (400.20)
The purpose of integer literals and register operands is the same for REIL as it is for native assembly instructions. REIL addresses are necessary because there are certain native assembly instructions that contain branches within itself (think of the x86 REP instructions). These internal branches are simulated by REIL JCC instructions with REIL addresses as the third operand. I will talk more about REIL addresses in the third part of this series.
All REIL instructions have three of those operands. However, not all instructions require all three operands to be present. Unnecessary operands have a special type ’empty’ that is not printed when you write down a REIL instruction. That’s why in the example code above you see instructions that have operands separated by commas but without any operand names between the commas. What operands are omitted depends on the functionality of the operands. We have always tried to have the first two operands act as input operands while the third operand is the output operand. The BISZ instruction, for example, has the first and third operand but not the unnecessary second input operand.
That’s it for the second part of this series. In the next part I will talk about translating native code to REIL code.