Monday, October 22, 2012

Exception Driven "Debugging": Getting behind anti debugging tricks.

Of course, every debugging is exception driven. At least because a breakpoint generates debug exception wich is passed to debugger. In this article, however, I will refer to regular exceptions.

There are tens if not hundreds of software protectors used by software vendors around the globe. Some are good, some are less good, in either case, vendors rarely use them in a proper way, thinking that simply enabling anti-debugging features, provided by protector of their choice, is enough. I have seen it myself - a widely known commercial application, protected using Themida (which is one of the most complicated protectors) remains SOOO unprotected, that Themida is not even notices during the extraction of relatively sensitive information using the application itself.

However, the purpose of this article is not to discuss pros and cons of Themida or any other protector, nor do I have any intention to disgrace any of the software vendors. The purpose is to describe a relatively easy way of bypassing common anti debugging tricks (including Windows DRM protection)  with DLL injection.

As the term "anti debugging" states - such methods target modern debuggers. There are several commonly known tricks:
  1. IsDebuggerPresent() - you would be surprised to know how many vendors rely on this API alone;
  2. Additional methods of debugger presence detection;
  3. IAT modification - which is not really worth trying;
  4. Redirection of debugging API (e.g. to an infinite loop).
  5. And some more.
Point #4 does not let you to implement your own debugger in a hope that it would not be noticed by the victim program (many beginners fall out at this point).

Point #3 - how much can you modify the IAT? I mean, system loader has to still be able to parse it, thus, if system loader can - everyone can.

Point #1 is not even worth further mentioning here.

In this article I am going to describe a simple way (although, some may cry and say it is a hard way) to get around most of anti debugging tricks without even noticing their presence by implementing a simple pseudo debugger dll, which is to be injected into the target process.


Step #1. Preparations

In order to use any debugger, you have to know where to set your breakpoints. Otherwise, the whole process is meaningless. But how can you define proper locations if the executable on disc is encrypted (e.g. with Themida) and you still cannot attach a debugger to see what is going on inside?

The solution is quite simple. Simple in deed. Windows provides us with all the instruments to read the memory of another process (given that you have sufficient access rights) with OpenProcess(), ReadProcessMemory() and NtQueryInformationProcess() API functions. Using those, you can simply dump the decrypted executable and any of its modules (DLLs) to a separate file on disc.

NtQueryInformationProcess() provides you with the address of the PEB (see this post for more information on PEB) of the target process. Then you simply parse the linked list of loaded modules, get the base address (module handle) and the image size for each, then use ReadProcessMemory to copy the image to a file. One complication, though, you will have to use ReadProcessMemory in order to access the PEB of the remote process.

Once you have dumped the target image to a file, such file can be easily loaded into IDA Pro, disassembled and researched statically.


Step #2. Injector and DLL

I do not see any reason to describe the DLL injection process here, as it has been described many times, even in this blog. You are free to use standard injection method, advanced DLL injection method or use this method if you have problems with the two previously mentioned.


DllMain()

It is suggested not to perform any heavy action in this function, however, we do not really have a choice (although, you can launch a separate thread). First thing to do is to suspend all running threads (except the current one of course). The problem is that Windows has no API function that would allow you to enumerate threads of a single process, instead, it lets you go through all the threads in the system. See MSDN pages for Thread32First and Thread32Next - there should be a perfect example of getting threads of the current process. Once all the threads are suspended, you are ready to proceed.


Installation of breakpoints 

No, we are not going to use regular 0xCC software breakpoints, neither are we going to make any use of hardware breakpoints here. Instead, we are going to place an instruction that would raise an exception to the location of desired breakpoint. To keep such instruction short and to avoid changing the values of the registers, 'AAM 0' seems to be a perfect candidate. It only takes two bytes 0xD4 0x00 and raises the EXCEPTION_INT_DIVIDE_BY_ZERO exception (exception code 0xC0000094).

Use the VirtualProtect() function to change the access rights of the target address, so you can alter its content, backup the original two bytes from that address and overwrite them with 0x00D4

VirtualProtect((LPVOID)(target & ~0xFFF), 0x1000, PAGE_EXECUTE_READWRITE, (PDWORD)&prevProtect);
*((unsigned short*)target) = 0x00D4;
VirtualProtect((LPVOID)(target & ~0xFFF), 0x1000, prevProtect, (PDWORD)&prevProtect);

Now the victim process is almost ready to be continued. One thing left - exception handler. We will use vectored exception handling mechanism as it allows our handler to be (at least among) the first to handle an exception. Once the handler has been added with AddVectoredExceptionHandler(), you may resume the suspended threads of the process.



Handler

One important thing to do once your handler gets control, is to check for the address where the exception occurred and for the exception code, as we have no intention to deal with irrelevant exceptions:

LONG CALLBACK handler(PEXCEPTION_POINTERS ep)
{
   if(ep->ContextRecord->Eip == target && ep->ExceptionRecord->ExceptionCode == 0xC0000094)
   {
      // Do your stuff here
   }
   else
      // Optionally log other exceptions
      return EXCEPTION_CONTINUE_SEARCH;
   return EXCEPTION_CONTINUE_EXECUTION;
}


Your Stuff

One of the parameters you get with your handler is the pointer to the CONTEXT structure, which provides you with the content of all the registers at the time of the exception. Needless to mention, that you have the access to the process' memory as well. Just as you were in a debugger with the only difference - you have to implement the routine that would show you the data you are interested in. Do not forget to emulate the original instruction replaced by the pseudo breakpoint and advance the Eip accordingly before returning from handler.

One more thing to mention - it may be a good idea to suspend all other threads of the victim process while in the 'your stuff' portion of the handler.


Stability

I am not claiming this method to be bullet proof and I am more than sure ( I simply know) - there are ways to defeat it, however, personally, I have not yet met such software. In addition - this method is tested and stable.


Hope this article was helpful. See you at the next.

P.S. Lazy guys, nerds, etc., do not cry for sources. This method is really simple. Besides, if copy/paste is the only programming technique you are aware of, then, probably, this blog is not the right place for you.

Thursday, October 18, 2012

Method of Computer Virus Detection. Sad story of a patent application

It was quite a long time ago (an epoch ago by terms of software development). Around the end of 2005 and beginning of 2006. I was then working for Aladdin Knowledge Systems' eSafe unit as a computer  virus researcher (my first formal RE job). Detection methods were quite poor at that time, even heuristic ones (not that they are THAT good these days). There was quite a lot noise about the Morphine scrambler at that time and I was responsible for finding a proper solution for that issue by developing a reliable detection method. 

I have to admit - Morphine was quite an advanced scrambler at that time. A masterpiece, I should say. Standard methods, at least those used by eSage at that time did not work and required some changes to be made to the engine.

As this was about the only task assigned to me at that period, I decided to play a bit more with Morphine while waiting for the aforementioned changes to be made. 

It was so easy to identify Morphine's code by eye, but, somehow I could not fit the pattern into any programmatic method (of those used at the time, as I said). Well, there are plenty of expert systems and neural networks that mimic the path of decision making as it happens in our mind and there were such systems at the time. However, I was not yet aware of those and those I heard about looked quite complicated.

My decision was to try and build a simple system capable of recognition of logic patterns in the code. It is quite obvious, that different implementations of the same algorithm share the same logic, which appears in a form of at least opcode sequences, although the overall binary representation may be different even if you replace one register with another. This lead me to the simple system described below.

It is important to mention that all of  the following information is publicly available, so I do not violate any NDA or whatsoever.

Code Generalization
Our mind generalizes the disassembled code by extracting the relevant logic information. But how to do that in software? The solution is easier than I initially expected. I simply had to sort the opcodes by categories, assigning a numeric value to each category. For example, let's take three categories - stack, bitwise and flow control operations.  The following example shows two pieces of code, that are different on a binary level, but are completely identical logically:

Code #1                                Code #2                                      Generalized form
push  eax            push  edx                   0x0001
xor   eax, eax       xor   edx, edx              0x0002
pop   ebx            pop   ecx                   0x0001
ret                  jmp   dword[esp]            0x0003

As you can see - the two code snippets are identical logically, but are quite different if you try to compare them in compiled form. However, if you try to generalize those snippets, you will get the same result from the both.

This is really a basic explanation of the system. Besides, it has evolved since that time.

Automatic Signature Generation
The most pleasant thing about this system was its ability to extract signatures automatically. At that time, only two samples of the same malware were needed, right now - one is more than enough. However, let me concentrate on the method as it was initially presented.

As I mentioned - there was a need for two samples of the same malware. Their executable content was then generalized using the system described above into a couple of arrays of extracted categories which were compared one to another and all similarities were put into a separate list of potential signatures. Why potential? Just because at that stage any of them could be a signature of "legal" logic which might be found in any executable (e.g. library routine).

In order to eliminate such "false" signatures, the list was applied to a set of "clean" files and each potential signature found in any clean file was removed.

Efficiency
The very first test results showed that Morphine, a masterpiece of polymorphism may be recognized with a single logic signature (and I tested it on thousands of files scrambled with Morphine). Needless to say that the efficiency was as good for at least 95% of malware known at that time. Basically, that meant that the database of several tens of megabytes could be replaced with a list of several kilobytes.

What's sad about this?
My employer at that time - Aladdin Knowledge Systems applied for a patent. Several years later (I was working for some other firm already) I came to know that the application was denied by USPTO.  The reason was quite surprising... As I had a chance to read the correspondence of the patent attorney and the examiner, I discovered that the application was denied based on the comparison algorithms used to compare the sequences of categories, which had TOTALLY NOTHING to do with the idea itself, which was about the preprocessing of data (extraction of logic patterns)... Somehow, this method (despite all the excitement) never got implemented in the product either...

For those interested, the application may be found here.









Tuesday, September 4, 2012

Time Series Analysis and Forecasting. Programming Approach - thoughts

"Certain things are impossible... 
Until an ignoramus appears, who is not aware of that".



Time Series - a sequence of data points, measured typically at successive time instants spaced at uniform time intervals. 

There are quite a lot of things that may fit this definition. For example, air temperature changes throughout the day (let's say, hourly measured), distance from the Earth to the Moon (which changes slightly throughout the lunar month). Even which political party holds the presidential chair after the elections (which depends of the "history" of the previous president, etc.) We can go on with the list of examples until the server's storage is full. As you may see, the examples above have cyclic nature, but so is everything (or at least everything) related to time series (of course, within certain deviations).

It is a nature of the mankind to want to know the future (although, sometimes it better not to know). Attempts are being made to predict, or let's use a more politically correct term - forecast, where certain series would go in the future. The best example may be shamans predicting rain or drought. These days there are complex (and not so complex) algorithms to forecast time series (e.g. noise reduction in digital signal processing). But the most scandalous and loud argument is going on about the stock market analysis and forecasting. Many of you may have heard about William Gann - some say genius, some say charlatan. I personally tend to take the first side, although, there may be facts that I am not aware of.

Mr. Gann died almost 60 years ago. Quite a long period of time. Imagine how many time series forecasting (read stock market forecasting) techniques have been born and how many have vanished. Since the chaos theory, more and more people tend to say that "stock market forecasting is impossible due to its fractal nature". Which makes sense if you look at the problem from the chaos theory's perspective. However, do not forget that chaos theory is accepted as the one that fits the situation the best, not as the one that fully explains it. In my perception, this tiny difference leaves a tiny space for hope ;-)

Well, we've had enough of science this far. Let us get to practice. Let me try to simplify things as much as possible, to demonstrate a simpler, yet effective approach from a developer's point of view.


Software

From software perspective, there's not too much needed for successful forecasts - an expert system. Smart people use different software packages and programming languages targeted at expert systems development, but being an ignoramus (as I decided to be for this article), I decided to use what I have and what I know - C language, GCC and Geany text editor as an IDE.


Data

There are several (graphical) ways to represent stock/forex market data. The most known one is candlesticks. A sequence of simple graphic figures, of which each one represents the variation of the price for a certain period of time (open, high, low and close values). We, however, are not going to consider any of them. Simply because we do not need that. Instead, we are going to concentrate on the raw row of numbers for a given period (let's say one year) measured hourly, which gives us a sequence of more then 8000 items (we are only paying attention to one value - either open, high, low or close).

If you try to plot this sequence (e.g. in Excel|) you will get a curvy line. Take another look at it and you will notice that there are similar segments (within certain deviations, of course). Just as a set of similar images, which would bring up one of the best approaches for image recognition - Artificial Neural Networks (especially perceptrons). Although, there is nothing new in using ANN for stock/forex market analysis. There are tones of commercial software products that provide the end user with different indicators telling him/her whether to buy, sell of hold the current position, I personally have not seen a lot of attempts to actually make long term (e.g. 24 hours for an hourly measured sequence) forecasts. There is also a lot of uncertainty as to what data should be used as ANN's input and how much data should be fed in each time. Unfortunately, no one has the exact answer for this question. It is just your trial and error. The same applies to the amount of hidden neurons in the ANN.

Another big question is how should the data be preprocessed - prepared for the ANN. Some use complex algorithms (Fourier transform, for example), other tend to use a more simplistic ones. The idea is that data should be in the range of 0.0- 1.0 and it should be as varied as possible. But remember - if you feed ANN with garbage - you get garbage in response. Meaning that you have to carefully select your algorithm for data preprocessing (normalization). I tend to use a custom normalization algorithm, which is quite simple. Sorry to disappoint you, but I am not going to give it here for now as it is still not completely defined (although, it already produces good results).

The bottom line for this paragraph - data preprocessing is not very important, it is the MOST important.


Instruments

My programming solutions for this problem is quite simple - a console program that reads the input (the whole sequence of price values for the specified period), trains an artificial neural network (in my case the topology was 8x24x1 - 8 inputs, 24 hidden neurons and one output neuron), and then produces a long term forecast (at least 7 entries into the future) while each step of the forecast is done using the previously generated values.

The ANN is a simple multilayer perceptron with 8 inputs, 24 hidden neurons and 1 output neuron. Basically saying - we do not perform much calculations ourselves, if at all. ANN is a perfect implementation of a learning paradigm, able to find hidden dependencies and rules. Therefore, if you ask me - there is no better solution then utilizing ANNs for time series forecasting.


Test

So, I implemented an ANN (in C this time, not in Assembly) and got the dataset (EUR/USD price values for every hour of the past year). The next move was to give it a try and test in run time. I decided to do that during the weekend as I was not sure about how much time would be required to train the network. Surprisingly, I got a good error after only about 30,000 epochs (several minutes). The following picture shows what I got:

EUR/USD forecast

Test set - data not included in the ANN training process. Used as a pattern for error calculation.
Test forecast - forecast on data from the past, which was not included in the training set.
Real forecast - forecast of the future values. This was done on Saturday at least 24 hours before the opening of the next trading session.
Real data - real values obtained Monday early morning after the new trading session began.

As you can see, such simple system was even able to forecast the gap between the two sessions.


P.S. Although, this article contains no source code, no description of any interesting programming technique or whatsoever, it comes to show, that each problem has a (not necessarily complicated) solution. Most of the time, the most important thing is to take a look at a problem from another angle.




Friday, August 31, 2012

Emulation of Hardware. CPU & Memory

There are tens of hardware platforms (although, some people would say that there is only one - computer ;-) ). Each one has its own advantages over others and disadvantages as well. For example Intel is the most used platform for desktops, ARM and MIPS are widely used in embedded systems and so on. Sometimes, a need may arise to test/debug executable code written for platform other then the one you have access to. For example, what if you have to run ARM code while using Intel based desktop? In most cases, this is not a problem at all due to a large amount of available platform emulators (e.g. QEMU and many others). However, even though QEMU is quite a powerful tool, there are certain cases when it is not helpful (at least not without certain modifications).

Note for nerds:
Yes, there are such cases - if you have not seen one, does not mean they do not exist. 
The code in this article is for demonstration purposes only - checks for errors may be omitted. It may be unoptimized.
Yes, there may be better ways.

Either forced by current needs or just for fun, you may want to write your own emulator for any existing (or not existing) platform. You may check this article to see how a simplistic CPU may be designed and implemented. However, CPU is only a tiny (although, important) part of your emulator. There are many other things that you would have to take care of, such as memory, IO devices, etc. Of course, the complexity of the implementation depends on how isolated you want your emulator to be. 

As you may understand from the title of this article, we are going to concentrate on the CPU to Memory (RAM) interface. It may be a good idea to define how much memory should your emulator support (define the width of the address line) in advance. For example, if you are going to support at most 64 kB, then 16 bit addressing mode would be enough. In such case, you may simply allocate a continuous memory area and access it directly. However, what if you plan to support 1 or 2 or even more gigabytes? Although, it would not necessarily be used at once, but your architecture may imply this. You definitely would not want to make such a huge allocation. Especially not if the software you are planning to run uses a tiny bit of memory in lower address space, a tiny bit in the upper and itself is loaded somewhere in the middle. If this is the situation, then you should implement a kind of a paging mechanism, which would only allocate pages for addresses which are actually being used.


Paging

Let's make some definitions to deal with pages:

#define PAGE_SIZE 0x1000 // You may choose to use other size
#define PAFE_MASK 0x0FFF // This depends on the value of PAGE_SIZE

typedef struct _page_t
{
   struct _page_t*   previous, next;
   unsigned long     base;  // Address in the emulated memory represented by this page
   unsigned int      flags; //Whatever flags you want your pages to have
   unsigned char*    mem;   // Pointer to the actual allocated memory
}page_t;

The mechanism is quite similar to the actual paging mechanism used today, except that you do not have to use page tables as most of the time a simple linked list of pages is enough and that you are not mapping virtual memory to physical, but mapping emulated memory to the virtual memory which is accessible for the emulator.

previous and next - pointers to other page_t structures in the linked list of pages;
base - lower address of the emulated memory represented by this page;
flags - any attributes you would like your pages to have (e.g. is it writable or executable, etc.);
mem - pointer to the memory area actually allocated by the emulator.

Using such mechanism will reduce the overall memory usage as you would have to allocate only those memory areas used by the software you are running on your emulator.


Page Management

It is, of course, up to you how to manage this kind of paging, but, as it seems to me, it may be a good idea to implement a set of functions to manage the sorted (by base) linked list of pages:

page_t* memory_page_alloc(void);

This function would simply return a pointer to an allocated page_t structure. Don't forget to allocate real memory area of PAGE_SIZE and store a pointed to it in page_t->mem

void  memory_page_release(page_t** pg);

This function releases all the resources allocated for a page. This includes the memory which actually represents the page and is pointed to by page_t->mem and the page_t structure itself.

int  memory_page_add(page_t** page_list, unsigned long base);

This function is responsible for allocation of a new page, which would represent memory starting at base and its insertion into the sorted linked list of pages.
*page_list - pointer to the first page in the linked list of pages;
base - beginning address of the emulated memory of size PAGE_SIZE.
Its return value should tell you whether a page has been added or an error occurred during memory allocations.


Memory Access Emulation

Due to the fact that we are not talking about one consistent array, but rather several separated memory areas (from the emulator's point of view) it makes sense to write a couple of functions that would perform read/write operations from/to the emulated memory.

int memory_read_byte(page_t* pg_list, unsigned long address, unsigned char* byte);

This function is responsible for reading a single byte from the emulated memory pointed by address. The read byte is returned into location pointed by byte. It walks the linked list of pages looking for a page where page_t->base <= address && (page_t->base + PAGE_SIZE) > address. If there is no such page, then it either allocates and adds it to the list of pages, then performs the read operation or simply returns error (this may be helpful in order to emulate memory access violations). It is up to you to define the behavior of this function in such situation. In fact, you may define an internal flag to enable/disable automatic page allocations.

int memory_write_byte(page_t* pg_list, unsigned long address, unsigned char byte);

This function is almost identical to the one above, except that it writes a single byte to the emulated memory. Its behavior should be the same as memory_read_byte.


It is definitely not that good to only be able to transfer one byte at a time, so you are more then welcome to implement functions for larger transfers. However, you will need to be careful in those cases when such transfer involves two pages and check that both pages are allocated (meaning accessible).


Of course, there are many more things to emulate like IO devices, possibly network adapters, but memory is the most important. But this goes far beyond the scope of this article.

Hope this article was informative. See you at the next.




Wednesday, May 30, 2012

CreateRemoteThread. Bypass Windows 7 Session Separation

Internet is full of programmers' forums and those forums are full with questions about CreateRemoteThread Windows API function not working on Windows 7 (when trying to inject a DLL). Those posts made by lucky people, somehow, redirect you to the MSDN page dedicated to this API, which says: "Terminal Services isolates each terminal session by design. Therefore, CreateRemoteThread fails if the target process is in a different session than the calling process." and, basically, means - start the process from your injector as suspended, inject your DLL and then resume the process' main thread. This works... Most of the time... But sometimes you really need to inject your code into a running process. Isn't there a way to do that? Well, there is. As a matter of fact, it is so easy, that I decided not to attach my source code to this article (mainly, because I am too lazy to make it look readable :) ). It appears to be that I am not the only one lazy here :), so I have uploaded the source code.

Let me start as usual, with a note for nerds in order to avoid meaningless comments and stupid discussions. 
The code provided within the article is for example purposes only. Error checks have been omitted on purpose. Yes, there may be another, probably even better, way of doing this. No, manual DLL mapping is not better unless you have plenty of time and nothing to do with it.

All others, let's get to business :)


Opening the Victim Process

This is the easiest part. At this stage you will see whether you are able to inject your code or not (in case of a system process, for example). Nothing unusual here - you simply invoke the good old OpenProcess API

HANDLE WINAPI OpenProcess(
       DWORD dwDesiredAccess, /* in our case PROCESS_ALL_ACCESS */
       BOOL  bInheritHandle, /* no need, so FALSE */
       DWORD dwProcessId /* self explanatory enough */
);

which opens the process specified by dwProcessId and returns a handle to that process, unless, you have no sufficient rights to access that process.


Reading the Shellcode

What you usually see in the examples of shellcode over the internet, is an unsigned char array of hexadecimal values somewhere in the C code. Helps to keep the amount of files smaller, but is not really comfortable to deal with. I decided to store the shellcode in a separate binary file, produced with FASM (Flat Assembler):

use32
   ; offset of the LoadLibraryA address within the shellcode
   dd    func
   ; save all registers
   push  eax ebx ecx edx ebp edi esi
   ; get your EIP
   call  next
next:
   pop   eax
   mov   ebx, eax
   ; get the address of the DLL name
   mov   eax, string - next
   ; do this to avoid possible negative values (due to sign extend)
   movzx eax, al
   add   eax, ebx
   ; pass it to the LoadLibraryA API
   push  eax
   ; get the address of the LoadLibraryA function
   mov   eax, func - next
   movzx eax, al
   add   eax, ebx
   mov   eax, [eax]
   ; call LoadLibraryA
   call  eax
   ; restore registers
   pop   esi edi ebp edx ecx ebx eax
   ; return
   ret
func     dd 0x12345678 ; placeholder for the address
string:

Compiling this code with FASM.EXE will produce a raw binary file, where all offsets are 0 - based. There are some parts in the code above, that may require some additional explanation (for example, why does it not end with ExitThread()). I am aware of this and I will provide you with the explanation a little bit later.

For now, allocate an unsigned char buffer for your shellcode. Make this buffer large enough to contain the shellcode and the name of the DLL (my assumption is, that you passed that name as a command line parameter to your injector). with it's terminating zero.

Once you have read the shellcode into that buffer - append the name of the DLL (which may be a full path to the DLL) to the end of the shellcode with, for example, memcpy() function. Half done with it. Now we still have to "tell" the shellcode where the LoadLibraryA API function is located in memory. Fortunately, the load address randomization in Windows is far from being perfect (addresses  of loaded modules may vary between subsequent reboots, but are the same for all processes). This means that, just as in usual DLL injection, we obtain the address of this API in our process by calling good old GetProcAddress(GetModuleHandleA("kernel32.dll"), "LoadLibraryA") and save it to the "func" variable of the shellcode. Due to the fact that our shellcode may vary in size from time to time (that depends on the needs), we saved the offset to that variable in the first four bytes of the shellcode, which eliminates the need to hardcode the offset. Simply do the following:

*(unsigned int*)(shellcode_ptr + *(int*)(shellcode_ptr)) = (unsigned int)LoadLibraryA_address;

Our shellcode is ready now.


"Create remote thread" without CreateRemoteThread()

As the title of this paragraph suggests - we are not going to use the CreateRemoteThread(). In fact, we are not going to create any thread in the victim process (well, the injected DLL may, but the shellcode won't).


Code Injection

Surely, we need to move our shellcode into the victim process' address space in order to load or library. We are doing it in the same manner, as we would copy the name of the DLL in regular DLL injection procedure:
  1. Allocate memory in the remote process with
    LPVOID WINAPI VirtualAllocEx(
       HANDLE hProcess, /* the handle we obtained with OpenProcess */
       LPVOID lpAddress, /* preferred address; may be NULL */
       SIZE_T dwSize, /* size of the allocation in bytes */
       DWORD  flAllocationType, /* MEM_COMMIT */
       DWORD  flProtect /* PAGE_EXECUTE_READWRITE */
    );
    This function returns the address of the allocation in the address space of the victim process or NULL if it fails.
  2. Copy the shellcode into the buffer we've just allocated in the address space of the victim process:
    BOOL WINAPI WriteProcessMemory(
       HANDLE   hProcess, /* same handle as above */
       LPVOID   lpBaseAddress, /* address of the allocation */
       LPCVOID  lpBuffer, /* address of the local buffer with the shellcode */
       SIZE_T   nSize, /* size of the shellcode together with the appended                                 NULL-terminated string */
  3.    SIZE_T   *lpNumberOfBytesWritten /* if this is zero - check your code */
    );
    If the return value of this function is non zero - we have successfully copied our shellcode into the victim process' address space. It may also be a good idea to check the value returned in the lpNumberOfBytesWritten.

Make It Run
So, we have copied our shell code. The only thing left, is to make it run, but we cannot use the CreateRemoteThread() API... Solution is a bit more complicated.

First of all, we have to suspend all threads of the victim process. In general, suspending only one thread is enough, but, as we cannot know for sure what is going on there, we should suspend them all. There is no specific API that would provide us with the list of threads for a specified process, instead, we have to create a snapshot with CreateToolhelp32Snapshot, which provides us with the list of all currently running threads of all processes running in the system:

HANDLE WINAPI CreateToolhelp32Snapshot(
   DWORD dwFlags, /* TH32CS_SNAPTHREAD = 0x00000004 */
   DWORD th32ProcessID /* in this case may be 0 */
);

This function returns the handle to the snapshot, which contains information on all present threads. Once we have this, we "iterate through the list" with Thread32First and Thread32Next API functions:

BOOL WINAPI Thread32First(
   HANDLE hSnapshot, /* the handle to the snapshot */
   LPTHREADENTRY32 lpte /* pointer to the THREADENTRY32 structure */
);

The Thread32Next has the same prototype as Thread32First.

typedef struct tagTHREADENTRY32{
   DWORD dwSize; /* size of this struct; you have to initialize this field before use */
   DWORD cntUsage; 
   DWORD th32ThreadID; /* use this value to open thread for suspension */
   DWORD th32OwnerProcessID; /* compare this value against the PID of the victim 
                              to filter out threads of other processes */
   LONG  tpBasePri;
   LONG  tpDeltaPri;
   DWORD dwFlags;
} THREADENTRY32, *PTHREADENTRY32;

For each THREADENTRY32 with matching th32OwnerProcessID, open it with OpenThread() and suspend with SuspendThread:

HANDLE WINAPI OpenThread(
   DWORD dwDesiredAccess, /* THREAD_ALL_ACCESS */
   BOOL  bInheritHandle, /* FALSE */
   DWORD dwThreadId /* th32ThreadID field of THREADENTRY32 structure */
);

and

DWORD WINAPI SuspendThread(
   HANDLE hThread, /* Obtained by OpenThread() */
);

Don't forget to CloseHandle(openedThread) :)

Take the first thread, once it is opened (actually, you can do that with any thread that belongs to the victim process) and suspended, and get its CONTEXT (see "Community Additions" here) using the GetThreadContext API:

BOOL WINAPI GetThreadContext(
   HANDLE    hThread, /* handle to the thread */
   LPCONTEXT lpContext /* pointer to the CONTEXT structure */
);

Now, when all the threads of the victim process are suspended, we are may do our job. The idea is to redirect the execution flow of this thread to our shellcode, but make it in such a way, that the shellcode would return to where the suspended thread currently is. This is not a problem at all, as we have the CONTEXT of the thread. The following code does that just fine:

/* "push" current EIP of the thread onto its stack, so that the ret instruction in the shellcode returns the execution flow to this address (which is somewhere in WaitForSingleObject for suspended threads) */
ctx.Esp -= sizeof(unsigned int);
WriteProcessMemory(victimProcessHandle, 
                   (LPVOID)ctx.Esp, 
                   (LPCVOID)&ctx.Eip,
                   sizeof(unsigned int),
                   &bytesWritten);
/* Set the EIP to our injected shellcode; do not forget to skip the first four bytes */
ctx.Eip = remoteAddress + sizeof(unsigned int);

Almost there. All we have to do now, is resume the previously suspended threads in the same manner (iterating with Thread32First and Thread32Next with the same snapshot handle).

Don't forget to close the victim process' handle with CloseHandle() ;)


Shellcode

After all this, the execution flow in the selected thread of the victim process reaches our shellcode, which source code should be pretty clear now. It simply calls the LoadLibraryA() API function with the name/path of the DLL we want to inject.

One important note - it is a bad practice to do anything "serious" inside the DllMain() function. My suggestion is - create a new thread in DllMain() and do all the job there, so that it may return safely.

Hope this article was helpful.

Have fun injecting and see you at the next.




Wednesday, May 23, 2012

Passing Events to a Virtual Machine

The source code for this article may be found here.

Virtual machines and Software Frameworks are an initial part of our digital life. There are complex VM and simple Software Frameworks. These two articles (Simple Virtual Machine and Simple Runtime Framework by Example) show how easy it may be to implement one yourself. I did my best to describe the way VM code may interact with native code and the Operating System, however, the backwards interaction is still left unexplained. This article is going to fix this omission.

As usual - note for nerds:
The source code given in this article is for example purposes only. I know that this framework is far from being perfect, therefore, this article is not a howto or tutorial - just an explanation of principle. Error checks are omitted on purpose. You want to implement a real framework - do it yourself, including error checks.
By saying VM's code I do not refer to the implementation of the virtual machine, but to the pseudo code that runs inside it.


Architecture Overview
Needless to mention, that the ability to pass events/signals to a code executed by the virtual machine implies a more complex VM architecture. While all previous examples were based on a single function responsible for the execution, adding events means not only adding another function, but we will have to introduce threads to our implementation.

At least two threads are needed:
Fig.1
VM Architecture with Event Listener

  1. Actual VM - this thread is responsible for the execution of the VM's executable code and events queue dispatch (processor);
  2. Event Listener - this thread is responsible for collection of relevant events from the Operating Systems and adding them to the VM's event queue (listener).
You may see that the Core() function, in the attached source code, creates additional thread.







Event ListenerThis thread collects events from the Operating System (mouse move, key up/down, etc) and adds a new entry to the list of EVENT structures.

typedef struct _EVENT
{
   struct _EVENT* next_event; // Pointer to the next event in the queue
   int            code;       // Code of the event
   unsigned int   data;       // Either unsigned int data or the address of the buffer
                              // containing information to be passed to the handler
}EVENT;

The code for the listener is quite simple:

while(WAIT_TIMEOUT == WaitForSingleObject(processor_thread, 1))
{
   // Check for events from the OS
   if(event_present)
   {
      EnterCriticalSection(&cs);
      event = (EVENT*)malloc(sizeof(EVENT));
      event->code = whatever_code_is_needed;
      event->data = whatever_data_is_relevant;
      add_event(event_list, event);
      event->next_event = NULL;
      LeaveCriticalSection(&cs);
   }
}

The code is self explanatory enough.  First of all it checks for available events (this part is omitted and replaced by a comment). If there is a new event to pass to the VM, it adds it to the queue. While in this example, event collection is implemented as a loop, in real life, you may do it in a form of callbacks and use the loop above just to wait for the processor thread to exit.


Processor

Obviously, the "processor" thread is going to be a bit more complicated, then in the previous article (
Simple Runtime Framework by Example), as in addition to running the run_opcode(CPU**) function, it has to check for pending events and pass the control flow to the associated handler in the VM code.

typedef struct _EVENT_HANDLER
{
   struct _EVENT_HANDLER* next_handler; // Pointer to the next handler
   int                    event_code;   // Code of the event
   unsigned int           handler_base; // Address of the handler in the VM's code
}EVENT_HANDLER;

DWORD WINAPI RunningThread(void* param)
{
   CPU*            cpu = (CPU*)param;
   EVENT*          event;
   EVENT_HANDLER*  handler;

   do{
      EnterCriticalSection(&cs);
      if(NULL != events)
      {
         event = events;
         events = events->next_event;

         // Save current context by pushing VM registers to VM's stack
         
         cpu->regs[REG_A] = (unsigned int)event->code;
         cpu->regs[REG_B] = event->data;

         handler = handlers;
         while(NULL != handler && event->code != handler->event_code)
               handler = handler->next_handler;
         
         cpu->regs[REG_IP] = handler->handler_base;

         free(event);
      }
      LeaveCriticalSection(&cs);

   }while(0 != run_opcode(&cpu));
   return cpu->regs[REG_A];
}

We are almost done. Our framework already knows how to pass events to a correct handler in the VM's code. Two more things are yet uncovered - registering a handler and returning from a handler.


Returning from Handler

Due to the fact that Event Handler is not a regular routine, we cannot return from it using the regular
RET instruction, instead, let's introduce another instruction - IRET. As event actually interrupts the execution flow of the program, IRET - interrupt return is exactly what we need. The source code that handles this instruction is so simple, that there is no need to give it here in the text of the article. All it does is simply restoring the context of the VM's code by popping the registers previously pushed on stack.


Registering an Event Handler

The last thing left is to "teach" the program written in pseudo assembly to register a handler for a given event type. In order to do this, we need to add one simple system call -
SYS_ADD_LISTENER.  This system call accepts two parameters:
  1. Code of the event to handle;
  2. Address of the handler function.
loadi  A, 0             ;Code of the event
loadi  B, handler       ;Address of the handler subroutine
_int   sys_add_listener ;Register the handler


Example Code

The example code attached to this article is the implementation of all of the above. It does the following:
  1. Registers event handler;
  2. Enters an infinite loop printing out '.' every several milliseconds;
  3. The first thread waits a bit and generates an event;
  4. Event handler terminates the infinite loop and returns;
  5. The program prints out a message and exits.


I hope this post was helpful or, at least, interesting.

See you at the next.