OSCE3 | OSCE | OSCP
Assembly language is a low-level programming language and is not generally used by developers to write full blown applications. Applications are written in high-level programming language, such as C, and are compiled into machine code. Machine code is binary that is encoded into instructions that are understood by the CPU. Assembly language is a human-readable representation of machine code.
We write shellcode in assembly language because we want to inject small pieces of code into memory without all of the overhead of a fully compiled application.
The following diagram shows a high-level view of x86 architecture.
The three buses are used to carry data, control instructions, and addressing. These aren’t that important in the context of writing shellcode, so will not be discussed.
The next sections will give enough information to begin writing basic shellcode in x86 assembly language.
The Arithmetic Logic Unit is the brains of the CPU. The ALU carries out calculations, compares values, increments values etc. Once the values have been processed they are generally stored in a general purpose register, for example the following opcode will be processed by the ALU and the CPU control unit (which we will not discuss) will save the result in the eax register:
add eax, ecx ; add the value in the ecx register
; to the value in the eax register and store in the eax register
Registers are a type of memory that is rapid for other parts of the CPU to access, but is expensive (and therefore the capacity is small). RAM is cheaper (and therefore the capacity is larger) but slower to access (in CPU terms). For this reason the CPU contains registers that can be used for rapid storage of values upto and including 32 bits in length.
The registers that we use the most when writing shellcode are the general purpose registers:
If we want to access only the lower 16 bits of the eax register we refer to it as ax, the lower 8 bits as al, and the higher 8 bits of ax we refer to as ah. This can be useful in shellcode, which will be discussed in a later article.
We can move values into registers and carry out arithmetic operations on them. For example:
mov eax, 0x10 ; move the value 0x10 into the eax register
add eax, 0x20 ; the eax register now stores the value 0x30
Two registers of note are esp; the extended stack pointer and ebp; the extended base pointer. These will be discussed in the stack section.
Most people that have done any sort of buffer overflow study will know what the eip pointer is. It is the extended instruction pointer, and points at the address of the next instruction to be executed.
Flags are used to control the flow of code. These are the instructions that are compiled from conditional statements and loops in C
:
if (x > 10) { // do something }
When the CPU carries out mathematical operations it changes the state of flag registers depending upon the result of the operation. Consider the following:
mov ecx, 0x10 ; assign the value 0x10 to the ecx register
mov eax, 0x10 ; assign the value 0x10 to the eax register
cmp eax, ecx ; compare the value in eax to the value in ecx
; the values are the same so the zero-flag (ZF) is set to 1
We can then control the flow of the instructions using a jnz (jump not zero) instruction:
jnz do_something ; jump to do_something if the ZF is 1
These flags will be discussed when we use them in our shellcode. Note that do_something
, like comments, is not a machine code, it is used in assembly language to group instructions (referred to as functions). This makes it easier to organise your shellcode when writing it.
Note: not all flags are used by conditional instructions, for example the interupt flag determines whether external inputs should be processed or not.
RAM in x86 32 bit is as the name would suggest; 32 bit addressable (0x00000000
through 0xffffffff
).When writing shellcode it is not common to hard-code addresses into the instructions, they are generally referenced as offsets:
mov [ebp-0x12], eax ; move the value in the eax register into the memory
; location at a -0x12 offset from the value referenced by ebp
We have also seen how we use labels to reference different sections of assembly code instead of hard-coded addresses. The reason we don’t hard-code addresses is because when memory is assigned by the kernel, or we are trying to overflow the stack, the memory addresses change upon each running of the application.
The stack is a memory structure used to store arguments, variables and pointers to variables. It is a temporary area in memory and every thread has one. vaues are pushed on to the stack and they are popped off the stack:
push eax ; push the value stored in eax onto the bottom of the stack
pop ebx ; pop the last value added to the stack into the ebx register
The stack is a last in, first out (LIFO) structure. You can actually save values, and reference values on the stack without manipulating the stack itself:
mov eax, [esp+0x4] ; mov the value at an offset of 0x04 of esp to eax
There is two important registers associated with the stack:
Just to complicate things the stack looks upside down to us mere humans!
When a value is pushed on to the stack esp is decremented by 32 bits. When a value is popped from the stack esp is incremented by 32 bits.
I have left some details out of the diagram above, for now we don’t need to understand the details about saved return addresses and eip. I assume if you are reading this, you already know what a buffer overflow is!
The stack is also very important in the x86 Win32 calling convention, which will be discussed in a later article.
Endianness is the order of bytes of data in computer memory. It can be big-endian (BE) or little-endian (LE). x86 architecture is little endian. In little-endian architecture, the least-significant byte is stored at the smallest address. This is important when it comes to assigning data to memory and will be discussed further when we start writing shellcode.
Feel free to leave comments or questions for this blog post. Please be respectful, I will moderate comments and reserve the right to remove them.