Identifying Function, Arguments, and Variables

This is my first essay in disassembling. A rather lengthy document written for a newbie in dead listing. This essay is not an all-in-one document. In this essay I only tackle function identification. I first got the idea to write this essay, when I see The Sandman's essay about dead listing. If you hadn't read it, I advise you to read it first. Although the essay is addressed for dead listing, some of the information can also apply to live approach. I hope that this essay will help you in your disassembling session whatever is that for ;) And if you had any comment, suggestion, or if you catch a mistake in this essay, I'd really appreciate, if you mail me. That way, I can revise this essay to serve a better service for us all.

To keep it simple, I focused highly with a C program, that run in Win32 environment. There are two reason, why I did this. First, a Win32 program written with C language will be the most common program that we will found and use. Secondly, by focusing with 32-bit memory model, I can avoid the complexity of segmented memory model.

Before we go into discussion on how a high-level function relates into it's assembly listing, it's not hurt to talk about the stack and call instruction first. After all, I write this tuts for a beginner in dead listing. If you already had enough knowledge about this two, you can skip this, and jump to Identifying Function.

The Stack

The stack is a memory area use by a program to store temporary variable. When you start a program, the operating system will look for a stack segment. Next, it will puts the stack segment address in SS and sets (E)SP to point to the first byte beyond the stack segment. High-level language compiler, will also use this area to hold some function arguments and local variables.

The stack, had the properties like a stack of plate. That is, the last one you put on, is the first one you take. But, unlike a stack of plate which holds nicely by gravity, the top of the stack had a lower address than the base of the stack. In other words: the stack grows downward. When a value is added to the stack, the stack pointer will be decrement.

You might also want to know, that there are two ways a CPU store a value in memory. Intel CPU is Little Endian (only). The words Little Endian is derive from the words Little End In. Which mean, the little end (lower byte) of a multibyte value will be stored first. For example, a value of 0x1234, will be stored in memory as 0x3412. Similarly, a value of 0x5678ABCD will be stored as 0xCDAB7856. This rule also applies to the stack. Normally, this won't be a problem. But, since we little cracker will dump the memory inside our debugger, we need to know this and mentally straighten the value we see in memory dump view.

Stack Related Registers

SS	Point to the segment of the program stack.
(E)SP	Stack Pointer. Points to the current stack value. Implicitly changed by push, pop, call, and ret.
(E)BP	Base Pointer. Usually point to the current stack frame for a procedure. An optimizing compiler may sometimes use (E)BP as a general purpose register.

Stack Instructions

The following instructions (not a complete list) allow the use of the stack to store or retrieve data: PUSH, PUSHF (push flags), PUSHA (push all word register), PUSHAD (push all double word register), POP, POPF (pop flags), POPA (pop all word register), and POPAD (pop all double word register). I'll only cover the two basic instruction: PUSH and POP.

The syntax for PUSH is:

    PUSH source

The PUSH instruction will decrement the value of (E)SP by size of source, and copy the value of source to [SS:(E)SP], overwriting the previous value. In a 16-bit program, source must be a word value. In 32-bit program, source can be a word or dword. The source can be a register (like EAX), a memory location, or an immediate value.

The syntax for POP is:

    POP destination

The POP instruction will copy (not move) the value of SS:(E)SP, and then decrement (E)SP by the size of destination. The destination can be a register or a memory location.

Let see how the two instruction works. Suppose we had these instruction:

    push 1
    push esi
    push edi
    push 2
    ...
    pop  eax
    pop  edi
    pop  esi

After all the PUSH instruction executed, the stack will look like this:

    higher memory value
                          1
                          esi
                          edi
                ESP ->    2
    lower memory value

And after all the POP intruction executed,

    higher memory value
                ESP ->    1
                          esi
                          edi
                          2
    lower memory value

with,

    eax = 2
    esi = pushed esi
    edi = pushed edi

I hope you can follow this quick introduction to the stack.

Identifying Functions

Although modern disassembler like W32DASM and IDA Pro will mark a function, there are times when the disassembler won't be able to find it. A function that is addressed through a pointer is such an example. Therefore, we still need to know, how the compiler translate a function from high level language into its assembly language equivalent.

The CALL and RET instruction

At the very basic, call instruction is simply a mean to transfer control to another part of a program. Just like a jmp, or jcondition. But unlike a jump, which is a permanent transfer of control, call stores the return information, so when the called routine terminate by a ret instruction, the program can go back the caller. Knowing this, it's clear for us that the success of a call instruction depends on the mechanics to store and retrieve the return information. Unless the call target is a TSS (task state segment) or a task gate, the mechanics is fairly simple.

The called routine return to its caller by saving the location of its caller's instruction. The call instruction do this by saving the value of (E)IP register before it jump into the called routine. As you probably knew, (E)IP points to the instruction that the CPU want to execute. Simply by changing the (E)IP register, you can change the way a program will execute. The call instruction saves this value by pushing it onto the stack. Since (E)IP is incremented as soon as an instruction is fetched, the value that is pushed onto the stack is the (E)IP value of the instruction following the call instruction. In a 32-bit program, all call is a near call. In a 16-bit program, where sometimes an intrasegment transfer is needed, the program need to use a far call. In a far call, the value of CS:(E)IP need to be saved. The CS register pushed first, then the value of (E)IP pushed onto the stack. When a called routine is terminated, the ret instruction will pop back the value of (E)IP (and CS for a far call) back into the (E)IP (and CS register).

The call syntax is very simple. Regardless of the target, the syntax is the same:

    call address

A problem occur, when the call is using indirect address. For example, call edi or, call [ebx+0Dh]. In the first case, we can search for instruction that assign edi into an address before the call instruction. Usually this will be an address of Windows API. For the second case, you can use a live approach (using debugger). Set a breakpoint there, and find out the value of ebx. Or, a better way, is to equip yourself with the necessary knowledge to find it. Fravia+ already wrote an excellent essay about call relocation table. Read it. All I can add is, with IDA Pro, you can use its Search in Core function to search for call relocation table. There are times, of course, when debugger is the only way to search for it. It is almost impossible for us to find where the address is referring to in a C++ program that had tons of indirect address. This is where W32DASM has an advantage over IDA Pro. With its built in debugger, finding the address will be a snap.

If a call instruction identifies a function entry point, then it will make sense for us if the ret instruction will identifies a function return point. Basically, the ret instruction will return to the caller by putting the appropriate (E)IP (and CS for a far call). Depending on the type of the call (and the target), ret can be retn (near return), retf (far return), or iret (involving task switching). A ret instruction can only be appear without the n or f suffix if it's written inside a proc directive. Good disassembler, like IDA Pro, which disassemble a program to a properly coded assembly source will usually put a ret instruction inside a proc directive. Then, it's the type of the proc (near or far) that define the type of the ret.

The ret instruction can specify a numeric parameter, that will identify how many bytes should be removed from the stack after the return address has been popped. If you see a code like this:

    some_procedure proc near
        ...
        ...
        pop  edi
        pop  esi
        add  esp, 10h
        pop  bp
        ret  8
    some procedure endp

Assuming that some_procedure is correctly popped all value that it pushed to the stack, the caller condition will be like this:

    ...                            ; esp points to xxxxxxxx
    push 32bit_variable            ; esp points to 32bit_variable
    push another_32bit_variable    ; esp points to another_32bit_variable
    call some_procedure
    inc  edi                       ; ret instruction will bring us here
                                   ; and esp points back to xxxxxxxx
    ...

Determining a function exit can be trickier. If the compiler optimizer is turned on, there may be several places where the function does a ret into its caller. Usually, whenever a function ends, there will be a start of another function. We can verify where a function ends, even if it had a multiple ret, by looking at the instruction that follows the ret instruction. When the disassembler can marks the next function properly, this probably won't be a problem. But in case it's not mark properly, we had to search for something that look like a function prologue.

Function Prologue and Epilogue

The standard prologue generated by a compiler, will be some variation of these:

Save original (E)BP on the stack.
Assign (E)SP (stack pointer) to (E)BP register.
Decrement the stack pointer, to make room for local variable.
Save calling function's register variable onto the stack.

Expressed in assembly languages, it will look like this (32-bit):

    push ebp         ; Save caller's EBP frame
    mov  ebp, esp    ; Set up new EBP frame
    sub  esp, xxxx   ; xxxx is the number of bytes needed for local variable
    push esi         ; Save caller's register
    push edi
    ...

In a C/C++ programs, the called function must preserve the following register: (E)SI, (E)DI, (E)BP, (E)SP, CS, DS, SS. That is, the called function could not change the value of the above register. If the called function will need these register (ESI and EDI will be the most commonly use), it must save them first on to the stack. However, the called function can use: (E)AX, (E)BX, (E)CX, (E)DX, and ES freely, without the need to save it first. The compiler will do the save in the prologue.

Note about ENTER and LEAVE:

Two special instruction were added into the 80286 and later processor to accomodates high level languages that require a stack frame when calling subroutines: ENTER and LEAVE. If the 80286 or better code generation is enabled, the compiler may opt to use this instruction. Look for:

    enter xxxx, 0h   ; xxxx is the number of bytes needed for local variable
    push  esi
    push  edi
    ...

The number 0h, in the ENTER instruction, is the level number. In level 0, enter will create a stack frame following these steps:

    push ebp
    mov  ebp, esp
    sub  esp, xxxx

If the level is above 0, ENTER will save the parent's (E)BP first as a back link, and the higher level's (E)BP in order, and ending with current (E)BP. This make it easy for a programs to reach for the higher level's variable. When the function look for higher level's variable, look for the code similar to this:

    mov esi, [ebp-4]        ; Get the topmost level's (E)BP.
                            ; Next level (E)BP is [ebp-8].
    mov eax, ss:[esi-8]     ; Get the first variable of the topmost level.
                            ; If the first variable is 32-bit value, the second
                            ; variable will be at [ebp-C]

It is pretty rare to find a program that use ENTER at level above 0. A program that is compiled with Clarion is probably an exception to this. As you can see from the description above, ENTER with level above 0 requires more space in the stack frame.

The LEAVE instruction will simply eliminates the current stack frame form the stack, by restoring the previous (E)BP and (E)SP.

It is essential that all pushed value must be clean up before the function return to its caller. Otherwise, the ret instruction will pop a wrong value to (E)IP. For this reason, a function epilogue will look quite much like the prologue code. Only this time, it will do the reverse of what the prologue did. For our example above, the code might look like this:

    ....
    pop  edi
    pop  esi
    add  esp, xxxx
    pop  ebp
    ret

or, using LEAVE instruction:

    ....
    pop  edi
    pop  esi
    leave
    ret

The above written code is kinda full blown prologue and epilogue. In the real world, some of it may be missing. If the program didn't need any variable at all, the compiler may not bother about setting up a stack frame. In a 32-bit program, even when a function use a local variable, the compiler still might not setting up an ebp stack frame. The 32-bit addressing mode allow the compiler to use ESP to address arguments and local variable. And, of course, there's always a possibility that the program is written with assembly languages. With these languages, all bets are off. The programmer is in complete control.

Function called by MFC message maps macro

This material is probably not for someone who just begin their venture in dead listing. However, since many of the program these days is using MFC, and several compiler already support it (including Watcom and Symantec), I think it's still an important thing to know. You can skip this if you want to.

There are several way to find the information. Using a hex editor, an exe dumper to dump the .rdata section, or, you can use IDA Pro's Search In Core function. It will also helps, if you understand what a message maps is. I won't discussed it here. If you are not MFC programmer, you can read about it in this Visual C++ Developer Journal's article written by George Shepherd and Scot Wingo.

The message maps that we're after is AFX_MSGMAP_ENTRY. It is defined as:

    struct AFX_MSGMAP_ENTRY
    {
        UINT nMessage;
        UINT nCode;
        UINT nID;
        UINT nLastID;
        UINT nSig;
        AFX_PMSG pfn;
    };

The first field identifies Windows message that is coming from the system. The messages definition is the same as the SDK. Most important message for our purpose is WM_COMMAND which is define as 0111h, and send by Windows everytime we click a menu, or a button. The second field represent the WM_NOTIFY code. The third field is the starting control ID, and the fourth field is the last control ID. If the control is in series (e.g. radio button group, cascading menu), the first item will be at nID, and the last item in nLastID. The fifth field is the signature of the function to handle the message. And the last field is a pointer to the function that handle the message. Knowing this, now we can locate a function for a particular button or menu.

The first thing that you want to do is to extract the resource from the executable that you were disassembling. I use Visual C++'s resource editor to do that. Locate the resource that you were interested with, and take note of its ID. If you want to follow the example, you had to disassemble a MFC program. Any MFC program will do.

Say, for example, you want to know which function that a program use for its save menu. Looking through your resource editor, you find that the target's Save menu ID is 57603.

In IDA Pro, go to the .rdata section (View, Segment, select .rdata from the list). Then, use IDA's "Search for Text in Core..." function (Alt+B). Make sure that it will search "down". If it's not, click cancel, and use the TAB key to toggle the search direction. Type 57603 for the search string, click decimal, and click OK. IDA will stop at something that look similar to this:

    ....
    0045C738                db    3  <- cursor here
    0045C739                db  E1h
    0045C73A                db    0
    0045C73B                db    0
    0045C738                db    3
    0045C739                db  E1h
    0045C73E                db    0
    0045C73F                db    0
    0045C740                db  0Ch
    0045C741                db    0
    0045C742                db    0
    0045C743                db    0
    0045C744                db  80h
    0045C745                db  32h
    0045C746                db  42h
    0045C747                db    0
    ....

Convert the value at the cursor location to a dword decimal. Fastest way to do it is by pressing "o" followed by "h" (o will convert it to a dword offset, h will convert the offset to a dword decimal. Not the correct way, but it works, and faster too :). You must see it as 57603. If it is not, then continue seaching. If it is, then try to convert the value before and after it. You must see something similar to this:

    ....
    0045C730                dd 111h
    0045C734                dd 0
    0045C738                dd 57603
    0045C73C                dd 57603
    0045C740                dd 0Ch
    0045C744                dd offset loc_423280
    ....

At offset 423280, you will found the function that handle this message. But, that's not the end of it. Since a similar ID might be use several times, by different classes, you had to continue your search until you're not in the .rdata segment anymore.

After you search for all occurences of this ID, you can verify that you were looking at the correct place by examining the surrounding byte sequence. Convert it the similar way. If you were looking for a button in dialog, then you will see the other button ID in the same dialog will be around it. When searching for a menu, remember, if a program had different classes that its derived from CView, there probably will be a multiple places where the same menu ID might appear. Experience with it. Maybe someday, somewhere, a tool like FRMSPY or EXE2DPR will be developed for this purpose (by you, my dear friend? ;). But until that day, this is the fastest way that I knew of. It's even faster than using a debugger. You could use Winice's stack command to search for it. But, you still had to compare several stack result to find the exact location. I'm still a beginner in MFC though. So, if you know another way, please mail me, so we can include it in this tuts.

Function Return Value

There's not much to say about a function return value. In a 32-bit program, the function return their value in EAX. A 16-bit program use AX for a 16-bit value, and a combination of DX:AX for 32-bit value. If, however, the program is written in assembly language, the function can return their value everywhere. One common assembler convention is, if the function is a boolean function, the function will set the carry flag (CF) as appropriate. If this is the case, you can look at the code following the call for a JC or JNC.

Function Arguments

Before we jump into discussion on how we can recognize a function arguments, we should know some calling convention out there. With the exception of the fastcall calling convention, compiler will passed arguments to a called function on the stack. Knowing the different calling convention, will help us to figure out where the arguments are. Calling convention dictates how, in assembly languages, arguments are passed into a function and how the stack will clean up when the function returns. Below is the table of some calling convention.

Calling Convention	Argument Passing	Stack Maintenance	Name Decoration (C only)	Notes
__pascal	Left to right.	Called function pops its own arguments from the stack.	mail me	Used for almost all export function in 16-bit Windows.
__cdecl (C Calling Convention)	Right to left.	Calling function pops arguments from the stack.	Underscore prefixed to function names. Ex: _Foo.	Used in CRT (C runtime library).
__stdcall	Right to left.	Called function pops its own arguments from the stack.	Underscore prefixed to function name, @ appended followed by the number of decimal bytes in the argument list. Ex: _Foo@10.	Use for almost all export function in Win32.
__fastcall	First two DWORD arguments are passed in ECX and EDX, the rest are passed right to left.	Called function pops its own arguments from the stack.	A @ is prefixed to the name, @ appended followed by the number of decimal bytes in the argument list. Ex: @Foo@10.	Because of it use specific register, it only applies to Intel CPUs. This is the default calling convention for Borland compilers (incl. Delphi).
thiscall	this pointer put in ECX, arguments passed right to left.	Calling function pops arguments from the stack.	None.	Used automatically by C++ code.
naked	Right to left.	Calling function pops arguments from the stack.	None.	Only used by VxDs.

Note: This table is cutted from John Robbins article in MSJ. Although the "Penguin guy" rarely made a mistake, I still check the result with my compiler. If you want to check it with a C++ compiler, use extern "C" for all the function to prevent C++ name mangling. My compiler (MSVC 4.2) also limit the ability of calling convention that I can make. There's no way for me to declare a Pascal style function. If you knew what the pascal style function name decoration is, then please mail me.

I'll explain the two most important column in our purpose: Argument Passing and Stack Maintenance. Beside, other columns is already written "in English" :)

The Argument Passing column tell us how argument will pushed onto the stack before the program call the function. If the argument is pushed from left to right, a function that is declared as:

    some_function (0x1000, 0x2000, 0x3000);

it would look like:

    push 1000h
    push 2000h
    push 3000h
    call some_function
    ...

Similarly, if the arguments is pushed from right to left, it would look like:

    push 3000h
    push 2000h
    push 1000h
    call some_function
    ...

If the argument is passed using the register (e.g. fastcall style):

    push 3000h
    mov  edx, 2000h
    mov  ecx, 1000h
    call some_function
    ...

You shouldn't expect that the pushed sequence will came together one after another. Say, for example, you want to allocate a memory in the heap of Win32 program. To do that, we can use the HeapAlloc function. But, rather than creating another heap, we want to use the process' heap instead. You might code it such as:

    LPVOID lpMem = HeapAlloc(GetProcessHeap(), 0, 1024);

And, in the disassembly, it might be showed as:

    push 400h               ; HeapAlloc arguments
    push 0                  ; HeapAlloc arguments
    call ds:GetProcessHeap  ; GetProcessHeap requires no arguments. So, the two
                            ; pushed value before, is ignored by the function
    push eax                ; Push the memory HANDLE return from GetProcessHeap.
    call ds:HeapAlloc       ; Now, HeapAlloc's arguments is complete, call it
    mov  [ebp-20], eax      ; save the return LPVOID pointer as local variable
    ...

Now, for the Stack Maintenance column. Did you still remember our discussion about the ret function, where the (E)SP is put back to its previous value, before all the argument pushing occur? Well, simply said, that is what stack maintenance is. Who will be responsible to set the value of (E)SP.

If the stack is maintained by the called function, we could look at the end of the function for code like this (in this example the function is a 32-bit function and had three arguments passed to it, sized at 32-bit each):

    ...                    ; this is a 32-bit code
    pop  edi
    pop  esi
    add  esp, 20h
    ret  0Ch               ; clean up the stack, pop 12 bytes from it

If it's the caller that is responsible for the stack maintenance, the caller code might look like this:

    ...                    ; this is a 32-bit code
    push 3000h
    push 2000h
    push 1000h
    call some_functions
    add  esp, 0Ch          ; clean up the stack, pop 12 bytes from it
    ...

But, you shouldn't expect add esp, xxxx after each call. Sometimes, when the called function only had one or two arguments, the compiler might just pop the stack into unneeded register, such as:

    ...                    ; this is a 32-bit code
    push 1000h             ; only one argument
    call little_functions
    pop  edx               ; clean up the stack, pop 4 bytes from it.
                           ; edx contain an unneeded value
    mov  edx, [ebp-20]     ; prepare edx will for another purpose
    ...

Stack maintenance help us to figure out how many arguments that the function we were examining is using (remember that the pushed sequence isn't always came one after another). If a 32-bit function pop 10h from the stack, then it must had 4 arguments. One point worth remembering is: All Win32 API's function arguments is 32-bit.

If you examine the Stack Maintenance column, you will notice that only the C calling convention dictates that the caller should do the clean up. One of the reason is the fact that in a C program a function might had variable number of argument. The printf function is an example of this. Since it's impossible for the called function to know in advance, how many argument(s) will be passed to the function, it couldn't clean up the stack inside the function. Therefore the stack maintenance is burdened to the caller. This is also the reason, why some Win32 API use C calling convention instead of stdcall calling convention. You could take a look at your program that call wsprintf or similar function.

When you disassembling a program, and want to know where the argument(s) to a function is placed, the first thing that we had to do is figuring out what calling convention that the function use. After we figure out which calling convention that a function use, the rest is quite a snap. Let's look a real world example (you probably getting tired with push 1000h by now :). A Dialog procedure in Win32:

    LRESULT CALLBACK AboutProc(HWND hDlg, UINT msg,
                               WPARAM wParam, LPARAM lParam)

After the function has executed its prologue code (i.e. after move ebp, esp executed), the stack will look like this:

Stack content	Location	Description
`lParam`	`[EBP+14h]`	Pushed by the caller
`wParam`	`[EBP+10h]`	Pushed by the caller
`msg`	`[EBP+0Ch]`	Pushed by the caller
`hDlg`	`[EBP+08h]`	Pushed by the caller
`return EIP`	`[EBP+04h]`	Pushed by CALL instruction
`Previous EBP`	`[EBP+00h]`	Pushed by the prologue code

If we found a code in our function that using the address [EBP+08h], we know that it using hDlg. Therefore, we can replace [EBP+08h] with [hDlg]. Similarly, we can replace [EBP+0Ch] with [msg]. The key point here is (you should remember this):

When a function is using an (E)BP stack frame, the function argument will had a positive offset from (E)BP

This is also true for 16-bit code too. Only, in 16-bit code, arguments are 16-bit wide, and if it is a far call, return CS will also pushed to the stack. CS will be pushed first (in [BP+04h]), followed with IP (in [BP+02h]).

In identfying arguments (and variables too), knowing some Win32 programming practice is also immensely helpful. For example, programmers probably won't cast a handle. If the caller routine that you were examining called Win32's GetParent() function, and passed the result to a function that we were examining, it will usually stay as hwndParent throughout the function. Similarly, a handle that is return by the CreateFile() function, will stay as a handle to file. If you do disassembly for cracking, and the caller function passed hwndParent to set up the registration dialog box, you probably won't find a need to examine how it will be used. In practice, this will save our eyes from examining some function that is called with this argument. On the other hand, a pointer can be cast to other type in the function. This won't show up in our disassembly. For assembly language, 32-bit is 32-bit, whether its point to string or a pointer to structure. So, you probably want to be careful about renaming it.

Alas, sometimes it is not that simple. With 32-bit program, the compiler can use ESP to address argument, such as [ESP+20h]. When you find this, you should be very careful about renaming it. Because the value of ESP is change, whenever a value is pushed to or popped from the stack. A string that is stored in [ESP+20h] might change to [ESP+28h] later, when the function pushed two DWORD to the stack. Fortunately for us, there were a dissasembler like IDA Pro. In IDA Pro, arguments and local variables can be renamed inside the disassembler. And, whether it is addressed with EBP or ESP, IDA Pro will always rename the right one.

Identifying Variable

Local Variables

A function's local variables is also stored on the stack. There is, sometimes, that the compiler will use the registers heavily to stored local variables. However, because some registers is used automatically in certain operation (like EDI in MOVSD, ECX in instruction that requires counter, EAX in IDIV), the use of registers for keeping the local variables throughout the function is quiet rare.

One important thing to remember is:

When a function is using an (E)BP stack frame, the function local variables will had a negative offset from (E)BP

You must remember though, that if the function use ESP to address function's arguments, it will also use ESP to address local variables. When the function use ESP, both arguments and local variables will had a positive offset from it. When examining local variables, the calling convention used by the function won't had any difference. The exeption to this is the naked calling convention, where, its the programmers themself that will write the function in assembly language.

Determining the type of the local variables is not as easy as determining arguments. When working with arguments, we knew whether its 16-bit wide or 32-bit wide, by examining the push instruction. In determining a local variables type, we had to look at how the particular variables will be used. I'll tell you how we can examine the type of local variables later in this essay. We can of course determine how many bytes that is used for local variables by the function. When we found the instruction sub esp, xxxx in the function prologoue, xxxx is the number of bytes the function require for local variables.

Global Variables

Determining global variables in a function is quiet easy. Global variables didn't need the assistance of EBP or ESP to address them. Thus, if we found an instruction such as:

    mov eax, [00421EB0]

we knew that the function is addressing a global variable. However, determining the type of the variables, will be just like determining the type of local variables. You had to find out how our function use it. Even when your disassembler will try to convert it to the correct value (based on the instruction that the disassembler found when the global variables is addressed), you shouldn't always trust it. Your disassembler is just trying to help. It won't replace our eyes, brain, and experiences :) When examining a 16-bit programs, you should be careful about renaming a global variables. You had to make sure first that the segment (DS or sometimes ES) is indeed point to the correct one.

Finding Out the Variables Type

I already mentioned how useful the Win32 API is when determining the type of function arguments. Well, the same hold true for determining the type of local variables too. When we were examining a function uses of local variables, Win32 API, should be the very first one that we search in the function. Sometimes, just by renaming the variables that the function use with Win32 API, our disassembly listing will look much clearer. Consider the following snippets:

    ...
    push    offset 00423440
    lea     edx, [ebp-5B0h]
    push    edx
    push    offset 0041EC2C
    push    offset 0041EC34
    call    ds:WritePrivateProfileStringA
    ...

Turning into our trusty API documentation, we know that WritePrivateProfileString is declared as:

    BOOL WritePrivateProfileString(
        LPCTSTR lpAppName,    // pointer to section name
        LPCTSTR lpKeyName,    // pointer to key name
        LPCTSTR lpString,     // pointer to string to add
        LPCTSTR lpFileName    // pointer to initialization filename
        );

Now we knew that at offset 00423440, we will found a string literals for the .INI filename, and local variable [EBP-5B0h] is a string literals for the key value to write to. Similarly, we can replace 0041EC2C and 0041EC34 with szKey and szSection respectively.

If the arguments, can be used to identify a variable type, the return value can also be used in identifying variables. Take a look at this example:

    ...
    call    ds:_hread
    mov     [ebp-40h], eax
    ...

From API documentation, we knew that _hread will return the number of bytes read. From our discussion, we knew the return value will be at eax. So, [EBP-40h] must be a dword value that contain the count of bytes read. You probably want to rename it as dwRead. Many variables might be used as a temporary variables. And how it used, will probably change in the different part of the program. Therefore, you should always check how that particular variables you renamed is used throughout the program. If it is used consistently, in that manner, then your guess will probably right.

Recently, disassembler like IDA Pro, takes this to a higher degree. Not only it will show us the Win32 API. With its FLIRT technology, IDA Pro will also show us when the function we examine use the CRT (C runtime library), MFC classes, Delphi function, or Borland's VCL function. Even when its statically linked. This is a wealth of information of course. If you work with Delphi programs a lot, you might want to download FLIRT library for Delphi from Peter Sawatzki's download page.

When the function is not using a known API, variables might not easily recognized. One variables type that might easily recognize is a boolean variables. When the code you were examining contain an instruction like:

    ...
    move eax, [ebp-40h]
    test eax, eax
    jz   loc_41453D
    ...

At [EBP-40h] you probably found a variable of type BOOL. Another type that might easily recognized, is a counter in a for loop. In C, the for loop usually coded as:

    for (i = 0; i < 1024; i++) {
            // processing some variables
        };

Please, carefully examine the following snippets. Pay attention to how [EBP-164h] is used:

        ...
        mov    dword ptr [ebp-164h], 0    ; Init [ebp-164h]
        jmp    short loc_41453D           ; Start from loc_41453D

loc_41452E:
        mov    ecx, [ebp-164h]            ; The [EBP-164h] again
        inc    ecx                        ; increment
        mov    [ebp-164h], ecx            ; and saved

loc_41453D:
        cmp    dword ptr [ebp-164h], 400h ; Is [ebp-164h] > 1024 ?
        jnb    short loc_414567           ; If bigger, jump
        ...                               ; Some codes that works with
        ...                               ; other variables
        ...                               ; Snip for brevity
        jmp    short loc_41452E           ; jump back

loc_414567:
        ...

Did you know how it works? You're right, [EBP-164h] is the counter. If you're not a C programmer, you probably didn't know one funny thing. A C programmer, rarely use a counter for different purpose. If you find a for statement inside your disassembly, many times you can bet that the same variables will be used for a counter again when the function do another for loop. One thing that programmers will use it for, is probably an index to an array. The code below is snip out from the same routine:

        mov    eax, [ebp-164h]
        mov    edx, ds:00423194[eax*4]

In the code above, [EBP-164h] contain an index for a global variable array at offset ds:00423194h. The array type is 32-bit integer. That's why the counter is incremented by four bytes (the code [eax*4]). If the array is a type of CHAR (1 byte), the code will look like this:

        mov    eax, [ebp-164h]
        mov    edx, ds:00423194[eax]

If both the index and the array is local variables, the code might not be as clear as the code above. An array of type DWORD will probably disassemble such as:

        mov    eax, [ebp-164h]
        mov    edx, [ebp+eax*4-40h]

In this code, we can found an array of type DWORD starting at [EBP-40h].

I guess this will be enough for now. There are many ways, that a compiler might generated assembly code. There's no way for me to write all that I found. Keep practicing. That's the only sure way. Disassembling is imprecise. You shouldn't always hope that you will recognized all the variables that a function use. Although, I myself prefer dead listing to live approach, we should always remember, that disassembly is just another options that we can use. It make no sense to wait for the dissasembler to disassembled a 3 MB program, if we can found the information in seconds using a debugger. Keep steady on the reason why did you disassemble this program in the first place. It's too easy to get lured by a function that is not important for us. Just like you get lured by a function seen inside your debugger. Instead of sticking with your breakpoint, you examine the function instead. Only to find out that its a call to GetLastError (did you still remember that experience? ;).

Return

by Rhayader - September 30th, 1998