JIT compiler for Brainfuck language - optimization of an interpreter PART 1

Hello guys!

About two weeks ago I started the project which is called "JIT compiler". The JIT compiler is the Just-In-Time compiler which means that it can compile our code faster than the standard compiler. This statement can sound very hard, but you will understand it better when I show you my way to write it in the C language. Now I can tell you that this project is curious for me and I've learned a lot while writing this so I recommend you to create such projects as JIT compiler for example. :) Assuming that you already know what compiler is and what it does I will show you the code of my JIT compiler for Brainfuck language.

The template of the main function looks like this:

	#include <stdio.h>
	#include <stdlib.h>
	#include <sysexits.h>
	#include <stdint.h>
	#include <assert.h>
	#include <string.h>
	#include <sys/mman.h>
	#include "vector.h"
	#include "stack.h"

	int main(int argc, char **argv)
	{
	if (argc != 2)
	{
	fprintf(stderr, "File not specified.\n");
	puts("usage: ./bf_interpreter <filename.bf>");
	return EX_USAGE;
	}

	const char *filename = argv[1];
	FILE *file_with_bf_code = fopen(filename, "r");

	if (file_with_bf_code == NULL)
	{
	perror(filename);
	return EX_NOINPUT;
	}

	const int memory_size = 30000;

	size_t file_size = get_file_size(file_with_bf_code);
	fseek(file_with_bf_code, 0, SEEK_SET);

	char source = (char )malloc(file_size);
	assert(source != NULL);

	size_t read_elements = fread(source, 1, file_size, file_with_bf_code);
	if (!read_elements)
	{
	fprintf(stderr, "Brainfuck code not found.\n");
	fclose(file_with_bf_code);
	return EX_DATAERR;
	}

	int number_of_bf_instructions = get_number_of_instructions(source);
	if (number_of_bf_instructions > memory_size)
	{
	fprintf(stderr, "You cannot read more than %d instructions.", memory_size);
	fclose(file_with_bf_code);
	return EX_DATAERR;
	}

	fclose(file_with_bf_code);

	struct bf_state values = bf_init(memory_size);
	jit(&values, source);
	free(source);

	return 0;
	}

view raw jit_main.c hosted with ❤ by GitHub

This code handles errors - for example, File Not Specified or Brainfuck code does not exist so this is the standard main function from Brainfuck interpreter program. But there are some new methods with new optimizations.

Let's take a look at get_number_of_instructions method first:

	int get_number_of_instructions(char *source)
	{
	const char bf_alphabet[] = "><+-,.[]";
	int number_of_bf_instr = 0;

	for (unsigned int i = 0; i < strlen(source); i++)
	{
	char *is_bf_instruction = memchr(bf_alphabet, source[i], sizeof(bf_alphabet));
	if (is_bf_instruction != NULL)
	number_of_bf_instr++;
	}

	return number_of_bf_instr;
	}

view raw jit_get_number_of_instr.c hosted with ❤ by GitHub

This method gives us some protection from the index error. The size of the memory in Brainfuck is exactly 30000 bytes so the above function returns the size of the brainfuck instructions in bytes.

	if (number_of_bf_instructions > memory_size)
	{
	fprintf(stderr, "You cannot read more than %d instructions.", memory_size);
	fclose(file_with_bf_code);
	return EX_DATAERR;
	}

view raw jit_check_size_of_instr.c hosted with ❤ by GitHub

And the if expression from the main function checks if the returned size of the instructions is less or equal to the limit of the allocated memory. Let's see the code responsible for that allocation. the bf_init function is what we are looking for.

	struct bf_state bf_init(size_t memory_size)
	{
	struct bf_state result =
	{
	0, (uint8_t *)malloc(memory_size), memory_size
	};
	return result;
	}

view raw jit_state_init.c hosted with ❤ by GitHub

When I was writing an interpreter I used an array of integers as a memory and that was an engineering mistake. Why? Because as you know arrays are allocated on the stack and the stack has the constant size. This size is different between operating systems. When we care about the portability of the code we have to know that one OS may have a 1MB size of the stack but another may have 30000 bytes of the stack size. So the solution is very simple because we can allocate the memory for the Brainfuck on the heap! And that is what the above method does. But don't kid yourself - the best part of the whole code is the jit method responsible for all optimization of interpreter code. Let's dive into it.

	void jit(struct bf_state state, char source)
	{
	assert(state != NULL && source != NULL);
	struct vec op_instructions;
	struct stack relocation_table = { .size_of_elements = 0, .elements = {0} }; // relocation_table from asm language
	int relocation_site = 0, relative_offset = 0;

	vector_alloc(&op_instructions);

	char jit_prologue[] =
	{
	0x55, // push rbp
	0x48, 0x89, 0xe5, // mov rbp, rsp
	0x50, // push rax
	0x51, // push rcx
	0x52, // push rdx
	0x48, 0x89, 0xf8, // mov rax, rdi
	0x48, 0x8b, 0x48, 0x08, // mov rcx, [rax+8]
	};
	vector_push_back(&op_instructions, jit_prologue, sizeof(jit_prologue));

	while (source[state->source_ptr] != '\0')
	{
	switch (source[state->source_ptr])
	{
	case '>':
	{
	char jit_inc_ptr[] =
	{
	0x48, 0xff, 0xc1 // inc rcx
	};
	vector_push_back(&op_instructions, jit_inc_ptr, sizeof(jit_inc_ptr));
	}
	break;
	case '<':
	{
	char jit_dec_ptr[] =
	{
	0x48, 0xff, 0xc9 // dec rcx
	};
	vector_push_back(&op_instructions, jit_dec_ptr, sizeof(jit_dec_ptr));
	}
	break;
	case '+':
	{
	char jit_inc_value[] =
	{
	0xfe, 0x01 // inc byte [rcx]
	};
	vector_push_back(&op_instructions, jit_inc_value, sizeof(jit_inc_value));
	}
	break;
	case '-':
	{
	char jit_dec_value[] =
	{
	0xfe, 0x09 // dec byte [rcx]
	};
	vector_push_back(&op_instructions, jit_dec_value, sizeof(jit_dec_value));
	}
	break;
	case '.':
	{
	char jit_putchar[] =
	{
	// using linux syscall write
	0xba, 0x01, 0x00, 0x00, 0x00, // mov rdx, 1
	0xbb, 0x01, 0x00, 0x00, 0x00, // mov rbx, 1
	0xb8, 0x04, 0x00, 0x00, 0x00, // mov rax, 4
	0xcd, 0x80, // int 0x80
	};
	vector_push_back(&op_instructions, jit_putchar, sizeof(jit_putchar));
	}
	break;
	case ',':
	{
	char jit_getchar[] =
	{
	// using linux syscall write
	0xba, 0x01, 0x00, 0x00, 0x00, // mov edx, 1
	0xbb, 0x00, 0x00, 0x00, 0x00, // mov ebx, 0
	0xb8, 0x03, 0x00, 0x00, 0x00, // mov eax, 4
	0xcd, 0x80, // int 0x80
	};
	vector_push_back(&op_instructions, jit_getchar, sizeof(jit_getchar));
	}
	break;
	case '[':
	{
	char jit_start_loop[] =
	{
	0x80, 0x39, 0x00, // cmp byte [rcx], 0
	0x0F, 0x84, 0x00, 0x00, 0x00, 0x00 // je <relative offset>
	};
	vector_push_back(&op_instructions, jit_start_loop, sizeof(jit_start_loop));
	}
	stack_push(&relocation_table, op_instructions.size);
	break;
	case ']':
	{
	char jit_end_loop[] =
	{
	0x80, 0x39, 0x00, // cmp byte [rcx], 0
	0x0F, 0x85, 0x00, 0x00, 0x00, 0x00 // jne <relative offset>
	};
	vector_push_back(&op_instructions, jit_end_loop, sizeof(jit_end_loop));
	}
	stack_pop(&relocation_table, &relocation_site);
	relative_offset = op_instructions.size - relocation_site;
	vector_jmp_update(&op_instructions, op_instructions.size - 4, -relative_offset);
	vector_jmp_update(&op_instructions, relocation_site - 4, relative_offset);
	break;
	}
	state->source_ptr++;
	}

	char jit_epilogue[] =
	{
	0x5a, // pop rdx
	0x59, // pop rcx
	0x58, // pop rax
	0x5d, // pop rbp
	0xc3 // ret
	};
	vector_push_back(&op_instructions, jit_epilogue, sizeof(jit_epilogue));

	void *jit_mem = mmap(NULL, op_instructions.size, PROT_READ \| PROT_WRITE \| PROT_EXEC, MAP_PRIVATE \| MAP_ANONYMOUS, -1, 0);
	memcpy(jit_mem, op_instructions.data, op_instructions.size);
	void (execute_func)(struct bf_state ) = jit_mem;

	execute_func(state);
	munmap(jit_mem, op_instructions.size);
	vector_delete(&op_instructions);
	}

view raw jit_jit_func.c hosted with ❤ by GitHub

In the first two parameters, this method accepts an object with the memory of Brainfuck values and the second parameter is the pointer to the Brainfuck code from the text file.

	struct vec op_instructions;
	struct stack relocation_table = { .size_of_elements = 0, .elements = {0} };

view raw instr_stream_plus_stack.c hosted with ❤ by GitHub

It's logical that when we want to write any compiler we are going to use some machine code. So we need some memory to store those instructions. Our memory will be the vector - same data structure like in C++, but implemented by me in C. All the code of the vector is on the vector.h and vector.c file on my GitHub. So the first of the above instructions is a declaration of that memory to store future machine code. The second line is the initialization of the simple stack data structure which is here to store the addresses of each [ Brainfuck instruction.

When you want to see the code of vector and stack data structures you can take a look at my GitHub.

JIT compiler is all about machine code and executable memory. The simplest example to write JIT compiler is:

1) Take the assembled code from objdump and throw it to the dynamic array.

2) Copy the content of this array to the executable memory

3) Execute machine code from this memory

And we have a great optimization of the code! So as you can see this is not a rocket science. (but you have to admit that the "JIT compiler" sounds pretty scary ;) )

"Take the assembled code from objdump"... ok, but we need to write instructions first. My last article was about the stack frame - the basic element of each function on our programs so I assume that you know what the stack frame is and why it is useful. I used this "structure" in the prologue because our memory with the machine code will be basically a function which processor will execute. So I constructed machine code like the function with stack frame and in addition, we have to save the state of general purpose registers. memory_segment variable in the memory structure is the pointer to all Brainfuck values so we will use it in the machine code to execute instruction correctly. We can simply pass the address of the values structure to our function and by mov , [values+8] get the memory_segment.

So this is the code of prologue:

	char jit_prologue[] =
	{
	0x55, // push rbp
	0x48, 0x89, 0xe5, // mov rbp, rsp
	0x50, // push rax
	0x51, // push rcx
	0x52, // push rdx
	0x48, 0x89, 0xf8, // mov rax, rdi
	0x48, 0x8b, 0x48, 0x08 // mov rcx, [rax+8]
	};
	vector_push_back(&op_instructions, jit_prologue, sizeof(jit_prologue));

view raw jit_prologue.c hosted with ❤ by GitHub

Since we have the pointer to the memory with the Brainfuck char values we can create all Brainfuck instructions in the machine code. Let's see what we can do:

	case '>':
	{
	char jit_inc_ptr[] =
	{
	0x48, 0xff, 0xc1 // inc rcx
	};
	vector_push_back(&op_instructions, jit_inc_ptr, sizeof(jit_inc_ptr));
	}
	break;

view raw jit_ptr_inc.c hosted with ❤ by GitHub

Above instruction is as simple as possible. inc rcx will basically increment pointer which will point to the next value in the memory. That's all.

	case '<':
	{
	char jit_dec_ptr[] =
	{
	0x48, 0xff, 0xc9 // dec rcx
	};
	vector_push_back(&op_instructions, jit_dec_ptr, sizeof(jit_dec_ptr));
	}
	break;

view raw jit_ptr_dec.c hosted with ❤ by GitHub

< means decrement pointer and this is exactly what this instruction does by dec rcx.

	case '+':
	{
	char jit_inc_value[] =
	{
	0xfe, 0x01 // inc byte [rcx]
	};
	vector_push_back(&op_instructions, jit_inc_value, sizeof(jit_inc_value));
	}
	break;

view raw jit_inc_value.c hosted with ❤ by GitHub

Square brackets in the assembly are the same as *() sequence. So with the above code, it increments the value (size of the value is equal to 1 byte of course) from the pointer which is stored in the rcx register. Simply it is equal to *(memory_segment)++;

	case '-':
	{
	char jit_dec_value[] =
	{
	0xfe, 0x09 // dec byte [rcx]
	};
	vector_push_back(&op_instructions, jit_dec_value, sizeof(jit_dec_value));
	}
	break;

view raw jit_dec_value.c hosted with ❤ by GitHub

This is the same like an incrementing value but now the instruction decrements it. Same as *(memory_segment)--;

	case '.':
	{
	char jit_putchar[] =
	{
	// using linux syscall write
	0xba, 0x01, 0x00, 0x00, 0x00, // mov rdx, 1
	0xbb, 0x01, 0x00, 0x00, 0x00, // mov rbx, 1
	0xb8, 0x04, 0x00, 0x00, 0x00, // mov rax, 4
	0xcd, 0x80, // int 0x80
	};
	vector_push_back(&op_instructions, jit_putchar, sizeof(jit_putchar));
	}
	break;

view raw jit_print_it.c hosted with ❤ by GitHub

As you can see I decided to use Linux syscalls to print the char on the stdout. It can be done by printf function but if you would like to do this you have to pass the address to this function and "tell" the processor to execute it.

If you are not familiar with the Linux syscalls you can read more about it here

	case ',':
	{
	char jit_getchar[] =
	{
	// using linux syscall read
	0xba, 0x01, 0x00, 0x00, 0x00, // mov edx, 1
	0xbb, 0x00, 0x00, 0x00, 0x00, // mov ebx, 0
	0xb8, 0x03, 0x00, 0x00, 0x00, // mov eax, 4
	0xcd, 0x80, // int 0x80
	};
	vector_push_back(&op_instructions, jit_getchar, sizeof(jit_getchar));
	}
	break;

view raw jit_getchar.c hosted with ❤ by GitHub

The same situation as earlier, but in this case, we want to get char from the user. Ideal solution for that is to use read syscall from stdin. Pay attention to the size of general purpose registers. When we want to move some integers into registers we can use 32-bit sizes of them because the size of the one integer is exactly 32-bits.

Now I recommend you to focus because I will show you how the loops interpretations are implemented. This is the code of the start of the loop which is [ Brainfuck instruction:

	case '[':
	{
	char jit_start_loop[] =
	{
	0x80, 0x39, 0x00, // cmp byte [rcx], 0
	0x0F, 0x84, 0x00, 0x00, 0x00, 0x00 // je <relative offset>
	};
	vector_push_back(&op_instructions, jit_start_loop, sizeof(jit_start_loop));
	}
	stack_push(&relocation_table, op_instructions.size);
	break;

view raw jit_start_loop.c hosted with ❤ by GitHub

As you know Brainfuck loops are really simple but first I will explain you my implementation of the [. So this JIT compiler is all about that direct jumps. This functionality can give us a big advantage over the interpreter. The jump should be executed when the checked memory cell is 0. So we have to type two main instructions - cmp byte [rcx], 0 -> compare the byte from memory cell with 0. je -> and if it's true jump to the offset after ] Brainfuck instruction. And this is exactly the content of our char array. Then this array of two machine code instructions is added to the op_instructions. (which is our vector data structure)
But wait... How to count the offset of ] instruction?! And this is the hardest part of the whole problem. First of all, you need to know that the whole machine code is placed in the executable memory before the execution of the program - and that's why we can update the offset "later". Usually, for counting the offset, we have to know where is the "start" address and the "final" address. Look at this picture:

To count relative offset we have to subtract THE HIGHER address from THE LOWER address and the result will be the offset between them. So that's why we need to push the index (index == address) of the last byte of the je (which is 0x00) machine code onto the stack. We have to save this address in the special place and the stack is the perfect structure for such operations.

	case ']':
	{
	char jit_end_loop[] =
	{
	0x80, 0x39, 0x00, // cmp byte [rcx], 0
	0x0F, 0x85, 0x00, 0x00, 0x00, 0x00 // jne <relative offset>
	};
	vector_push_back(&op_instructions, jit_end_loop, sizeof(jit_end_loop));
	}
	stack_pop(&relocation_table, &relocation_site);
	relative_offset = op_instructions.size - relocation_site;
	vector_jmp_update(&op_instructions, op_instructions.size - 4, -relative_offset);
	vector_jmp_update(&op_instructions, relocation_site - 4, relative_offset);
	break;

view raw jit_end_loop.c hosted with ❤ by GitHub

These two instructions are clear - cmp byte [rcx], 0, jne (relative_offset) means: When the value from the memory cell is not 0 then jump to the instruction after [. As I wrote above, we need to have the "start" address and "final" address to count the relative offset. (the number of bytes to jump) This is what we have:

start address -> index of the last byte from [ machine code instruction which is on the stack, at the end of the loop
final address -> index of the last byte from ] machine code instruction

Final address is the relocation_site and relocation_table is our stack structure with the addresses of the last byte from the [ machine code instructions. (relocation table) The number of bytes to jump is:

relative_offset = final_address - start_address;

So in our program it is:

relative_offset = op_instructions.size - relocation_site;

When we already know how many bytes are enough to do the correct jump we can simply update our relative_offsets in the machine code of jmp instructions and after this, the processor will execute our conditional jumps correctly.

	vector_jmp_update(&op_instructions, op_instructions.size - 4, -relative_offset);
	vector_jmp_update(&op_instructions, relocation_site - 4, relative_offset);

view raw jit_jmp_updates.c hosted with ❤ by GitHub

The above methods from the vector.c file are responsible for update the relative offsets. Each relative offset has to be placed in the memory as little endian. I will describe the rest of the files from my GitHub repository in the next article.

	vector_jmp_update(&op_instructions, op_instructions.size - 4, -relative_offset);
	vector_jmp_update(&op_instructions, relocation_site - 4, relative_offset);

view raw jit_jmp_updates.c hosted with ❤ by GitHub

The epilogue basically updates the register to the state before the execution of the program, removes the stack frame and jumps to the return address. It is the exit of the JIT compiler.

	void *jit_mem = mmap(NULL, op_instructions.size, PROT_READ \| PROT_WRITE \| PROT_EXEC, MAP_PRIVATE \| MAP_ANONYMOUS, -1, 0);
	memcpy(jit_mem, op_instructions.data, op_instructions.size);
	void (execute_func)(struct bf_state ) = jit_mem;

	execute_func(state);

view raw jit_execute_mem.c hosted with ❤ by GitHub

We have all machine codes in the op_instructions data structure but this is the raw data without any behavior. So we must "tell" the processor to execute these instructions. The processor executes instructions from the EXECUTABLE memory. The above lines of code are responsible for:

1) Creating the executable memory
2) Move the machine code from raw memory to executable memory
3) Execute
When we call the execute_func (casted pointer to the executable memory) the whole machine code will be executed... And this is JIT compiler!

	munmap(jit_mem, op_instructions.size);
	vector_delete(&op_instructions);

view raw jit_clear_heap.c hosted with ❤ by GitHub

The last instructions clear the heap from the object which we allocated there earlier. This is the protection from the memory leak . (Memory leak)

And this is the big different between linked-list based Brainfuck interpreter and directly jumps based JIT compiler:

Link to the whole code of the program: GitHub

Search This Blog

shizz3r IT tech blog

JIT compiler for Brainfuck language - optimization of an interpreter PART 1

Comments

Post a Comment

Popular posts from this blog

Learning of malware analysis. Solving labs from the "Analyzing malicious Windows programs" chapter from the "Practical Malware Anlysis" book

PicoCTF 2018 - Reverse Engineering writeups

Learning of malware analysis. Solving 9-2 lab from the "OllyDbg" chapter. ("Practical Malware Analysis" book)