May, 06 1994


Technical information "heuristic scanning"






Combatting viruses heuristically
================================

Generally speaking, there are two basic methods to detect viruses specific and 
generic.  Specific virus detection requires the anti-virus program to have 
some pre-defined information about a specific virus (like a scan string).  The 
anti-virus program must be frequently updated in order to make it detect new 
viruses as they appear.  Generic detection methods however are based on 
generic characteristics of the virus, so theoretically they are able to detect 
every virus, including the new and unknown ones.

Why is generic detection gaining importance?  There are four reasons:
1)  The number of viruses increases rapidly.  Studies indicate that the total 
number of viruses doubles roughly every nine months.  The amount of work for 
the virus researcher increases, and the chances that someone will be hit by 
one of these unrecognizable new viruses increases too.

2)  The number of virus mutants increases.  Virus source codes are widely 
spread and many people can't resist the temptation to experiment with them, 
creating many slightly modified viruses.  These modified viruses may or may 
not be recognized by the anti-virus product.  Sometimes they are, but 
unfortunately often they are not.

3)  The development of polymorphic viruses.  Polymorphic viruses like MTE and 
TPE are more difficult to detect with virus scanners.  It is often months 
after a polymophic virus has been discovered before a reliable detection 
algorithm has been developed.  In the meantime many users have an increased 
chance of being infected by that virus.

4)  Viruses directed at a specific organization or company.  It is possible 
for individuals to utilize viruses as weapons.  By creating a virus that only 
works on machines owned by a specific organization or company it is very 
unlikely that the virus will spread outside of the organization.  Thus it is 
very unlikely that any virus scanner will be able to detect the virus before 
the payload of the virus does its destructive work and reveals itself.

Each of these scenarios demonstrates the fact that virus scanners can not 
recognize a virus until the virus has been discovered and analyzed by an 
anti-virus vendor.


These same scenarios do not hold true for generic detectors, therefore many 
people are becoming more interested in generic anti-virus products. Of the 
many generic detection methods, heuristic scanning is currently becoming the 
most important.


Heuristic scanning
------------------

One of the most time consuming tasks that a virus researcher faces is the 
examination of files.  People often send files to researchers because they 
believe the files are infected by a new virus.  Sometimes these files are 
indeed infected, sometimes not.  Every researcher is able to determine very 
quickly what is going on by loading the suspected file into a debugger.  A few 
seconds is often enough, and many researchers must have asked themselves: "How 
can I determine this so quickly"?

One time I demonstrated this effect to the audience on an international 
conference. I showed the first page of the assembly listing of a MTE-infected 
file, and within about a second, Vesselin Bontchev came with the correct 
answer.

How is this possible?

Artificial intelligence

Some of the many differences between viruses and normal programs is that 
normal programs typically start searching the command line for options, 
clearing the screen, etc.  Viruses however never search for command line 
options or clear the screen.  Instead they start with a search for other 
executable files, by writing to the disk, or by decrypting themselves.

A researcher who has loaded the suspected file into a debugger can notice this 
difference in only a glance.  Heuristic scanning is an attempt to put this 
experience and knowledge into a virus scanner.  

The word 'heuristic' means (according to a Dutch dictionary) 'the self 
finding' and 'the knowledge to determine something in a methodic way'.

A heuristic scanner is a type of automatic debugger or disassembler. The 
instructions are disassembled and their purposes are determined. If a program 
starts with the sequence
        MOV AH,5
        INT 13h
which is a disk format instruction for the BIOS, this is highly
suspected, especially if the program does not process any command line options 
or interact with the user.



Suspected abilities

In reality, heuristics is much more complicated.  The heuristic scanners that 
I am familiar with are able to detect suspicious instruction sequences, like 
the ability to format a disk, the ability to search for other executables, the 
ability to remain resident in memory, the ability to issue non-standard or 
undocumented system calls, etc.  Each of these abilities has a value assigned 
to it. The values assigned to the various
suspicious abilities are dependant on various fact. A disk format
routine doesn't appear in many normal programs, but often in viruses. So it 
gets a high value. The abilities to remain resident in memory are found in 
many normal programs, so despite of the fact that they also appear in many 
viruses it doesn't get a high value. If the total of the values for one 
program exceeds a predefined treshold, the scanner yells "Virus!".  A single 
suspected ability is never enough to trigger the alarm.  It is always the 
combination of the suspected abilities which
convince the scanner that the file is a virus.

Heuristic flags

Some scanners set a flag for each suspected ability which has been found in 
the file being analyzed.  This makes it easier to explain to the user what has 
been found.

TbScan for instance recognizes many suspected instruction sequences. Every 
suspected instruction sequence has a flag assigned to it:

Flag  Description
----  -----------
F   = Suspicious file access.  Might be able to infect a file.
R   = Relocator. Program code will be relocated in a suspicious way.
A   = Suspicious Memory Allocation.  The program uses a non-standard                 
      way to search for, and/or allocate memory.
N   = Wrong name extension.  Extension conflicts with program structure.
S   = Contains a routine to search for executable (.COM or .EXE) files.
#   = Found an instruction decryption routine.  This is common
      for viruses but also for some protected software.
E   = Flexible Entry-point.  The code seems to be designed to be         
      linked on any location within an executable file.  Common for 
      viruses.
L   = The program traps the loading of software.  Might be a
      virus that intercepts program load to infect the software.
D   = Disk write access.  The program writes to disk without using 
      DOS.
M   = Memory resident code.  This program is designed to stay in         
      memory.


!   = Invalid opcode (non-8088 instructions) or out-of-range             
      branch.
T   = Incorrect timestamp. Some viruses use this to mark                
      infected files.
J   = Suspicious jump construct.  Entry point via chained or             
      indirect jumps.  This is unusual for normal software but common 
      for viruses.
?   = Inconsistent exe-header. Might be a virus but can also be a bug.
G   = Garbage instructions.  Contains code that seems to have no         
      purpose other than encryption or avoiding recognition by virus 
      scanners.
U   = Undocumented interrupt/DOS call.  The program might be just        
      tricky but can also be a virus using a non-standard way to detect 
      itself.
Z   = EXE/COM determination.  The program tries to check whether a 
      file is a COM or EXE file.  Viruses need to do this to infect a 
      program.
O   = Found code that can be used to overwrite/move a program in memory.
B   = Back to entry point.  Contains code to re-start the program after 
      modifications at the entry-point are made. Very usual for viruses.
K   = Unusual stack. The program has a suspicious stack or an odd 
      stack.

TbScan would for instance output the following flags:
     Virus                      Heuristic flags
     -----                      ---------------
     Jerusalem/PLO:             FRLMUZ
     Backfont:                  FRALDMUZK
     Minsk_Ghost:               FELDTGUZB
     Murphy:                    FSLDMTUZO
     Ninja:                     FEDMTUZOBK
     Tolbuhin:                  ASEDMUOB
     Yankee_Doodle:             FN#ELMUZB

The more flags that are triggered by a file, the more likely it is that the 
file is infected by a virus.  Normal programs rarely trigger even one flag, 
while at least two flags are required to trigger the alarm.  To make it more 
complicated, not all flags carry the same 'weight'.

False positives
---------------

Just like all other generic detection techniques, heuristic scanners sometimes 
blame innocent programs for being contaminated by a virus. This is called a 
"False Positive" or "False Alarm".





The reason for this is simple. Some programs happen to have several suspected 
abilities.  For instance, the LOADHI.COM file of QEMM has the following 
suspected abilities (according to an older, yet obsolete version of TbScan):

A   = Suspicious Memory Allocation.  The program uses a                  
      non-standard way to search for, and/or allocate memory.
M   = Memory resident code.  This program may be a TSR but also a virus.
U   = Undocumented interrupt/DOS call.  The program might be just tricky 
      but can also be a virus using a non-standard way to detect itself.
Z   = EXE/COM determination.  The program tries to check whether a 
      file is a COM or EXE file.  Viruses need to do this to infect a 
      program.
O   = Found code that can be used to overwrite/move a program in         
      memory.

All of these abilities are available in LoadHi, and the flags are enough to 
trigger the heuristic alarm.  As LoadHi is supposed to allocate upper memory, 
load resident programs in memory, move them to upper memory, etc., all these 
suspected abilities can easily be explained and verified.  However, the 
scanner is not able to know the intended purpose of the program, and as most 
of these suspected abilities are often found in viruses, it just describes the 
LoadHi program as "a possible virus".

How serious is the issue of false alarms?

If a heuristic scanner pops up with a message saying: "This program is able to 
format a disk and it stays resident in memory", and the program is a resident 
disk format utility, is this really a false alarm? Actually, the scanner is 
right. A resident format utility obviously contains code to format a disk, and 
it contains code to stay resident in memory. The heuristic scanner is therfore 
completely right! You could name it a false suspicion, but not a false 
positive. The only problem here is that the scanner says that it might be a 
virus. If you think the
scanner tells you it has found a virus, it turns out to be a false alarm. 
However, if you take this information as is, saying 'ok, the facts you 
reported are true for this program, I can verify this so it is not a virus', I 
wouldn't count it as a false alarm. The scanner just tells the truth. The main 
problem here is the person who has to make decisions with the information 
supplied by the scanner. If it is a novice user, it is a problem. More about 
that later.










Avoiding false positives

Whether we call it a false positive or a false suspicion doesn't matter.
We do not like the scanner to yell every time we scan. So we need to avoid 
this situation. How do we achieve this?

1)  Definition of (combinations of) suspicious abilities
    The scanner does not issue an alarm unless at least two            
    separate suspected program abilities have been found.
2)  Recognition of common program codes
    Some known compiler codes or run time compression or               
    decryption routines can cause false alarms. These specific compression 
    or decryption codes can be recognized by the scanner to avoid false 
    alarms.
3)  Recognition of specific programs
    Some programs which normally cause a problem (like the LoadHi
    program used in the example) can be recognized by the              
    heuristic scanner.
4)  Assumption that the machine is initially not infected
    Some heuristic scanners have a 'learn' mode, i.e. they are able to 
    learn that a file causing a false alarm is not a virus.

Dealing with false positives

Some false positives are not easily avoided.  So, the user has to deal with a 
certain amount of false alarms, and must make the final decision as to whether 
a file is infected or not.

Ok, you may say, how do we know whether a suspicious program is a virus or 
innocent. There is no way to find out, that is what most people believe. 
Actually there is a way to find out, but this depends on the scanner.

The scanner has to explain to the user the reasons why the program is suspect. 
'This file might contain a virus' actually doesn't say much to the user. It is 
always right. Every file MIGHT contain a virus, but MAY also be clean. We 
actually use a scanner to find out! What is the user supposed to do with this 
information?

However, if the scanner says that some program is able to remain resident in 
memory and able to format a disk, the user can more easily figure out what is 
going on.  If a word processor gives such an alarm, it is extremely likely 
that the program carries a virus, because word processors generally are not 
able to format disks and remain resident in memory.  However, if the suspected 
file is a resident disk formatting utility, then all of the suspected 
abilities can be explained by the intended purpose of the program.




        Reason for suspicion: memory resident and disk formatting
        abilities.

        Program                         Probably
        ------------------------        --------
        Resident disk formatter         No Virus (innocent)
        Word processor                  Malicious (virus)

        Both programs cause the same heuristic alarms, but the
        final conclusion is different.


Naturally, it requires an advanced user to draw a conclusion for the question 
"infected or not?".  However, my opinion is that judging the results of any 
scanner (also conventional scanners) is a task for an advanced user only.  If 
the scanner has a 'learn' mode, i.e. is able to remember which programs cause 
a false alarm, the initial scan should be performed by an advanced user, but 
the subsequent scans (when the possible false positives have been eliminated) 
can be performed by a novice user.  This is already common practice in most 
organizations.

Anyway, it isn't as bad as it seems, as all other detection methods (including 
signature scanning) are known to cause some false alarms as well.  Heuristics 
however has the advantage that it is able to supply you with enough 
information to establish for yourself whether a suspected file is likely a 
virus or not.


How does heuristic scanning perform?
------------------------------------

Heuristics is a relatively new technique and still under development. It is 
however gaining importance rapidly.  This is not surprising as heuristic 
scanners are able to detect over 90% of the viruses without using any 
predefined information like signatures or checksum values. The amount of false 
positives depends on the scanner, but a figure as low as 0.1% can be reached 
easily.

TbScan 6.02 used on the large virus collection of Vesselin Bontchev showed
the following results:

        Scanning        7210   detection
        method          files  percentage
        -------------   -----  ----------
        Conventional    7056    97.86%
        Heuristics      6465    89.67%




A false positive test however is more difficult to perform so there are no 
independent results available.


Combination of conventional and heuristic scanning
--------------------------------------------------

Some people think heuristic scanning is a replacement for conventional 
scanning.  In my opinion it is not.  Heuristic scanning serves a very useful 
purpose when used in combination with conventional scanning.  The results of 
both scanning methods can be validated by each other, thereby reducing false 
positives and also false negatives.

        Combined result of analysis:
        Heuristic   Conventional        Probability
        clean        clean              very probably clean
        clean        virus              might be a false         			
					positive
        virus        clean              might be a false 				
				        negative
        virus        virus              very probably infected

        fn: 10%      fn: 1%          	combined false negatives: 0.1%
        fp: 0.1%     fp: 0.001%      	combined false positives: 0.00001%

The chances of both the heuristic scanner and the conventional
scanner failing is minimal.  If both scanning methods have the same results, 
the result is almost certain.  In the few cases that the results don't agree 
with each other additional analysis is required. TbScan 6.02 used on the large 
virus collection of Vesselin Bontchev showed the following results:

        Scanning        7210   detection
        method          files  percentage
        -------------   -----  ----------
        Conventional    7056    97.86%
        Heuristics      6465    89.67%
        Combined        7194    99.78%














What can be expected from it in the future?
-------------------------------------------

The development continues

Most anti-virus developers still do not supply a ready-to-use heuristic 
analyzer.  Those who have heuristics already available are still improving it. 
 It is however unlikely that the detection rate will ever reach 100% without a 
certain amount of false positives.  On the other hand it is unlikely that the 
amount of false positives will ever reach 0%.

Maybe you wonder why it isn't possible to achieve 100% correct results. There 
is a large grey area between viruses and non-viruses. Even for humans it is 
hard to describe what a virus is or not, an often used definition of a 
computer virus is this: "A virus is a program that is able to copy itself". 
According to this definition the DiskCopy.Com program is a virus...


Reaction of virus writers

An important issue is the effect on virus writers. It is likely
that they will try to avoid detection by heuristic scanners.  Until now the 
goal was to avoid detection by signature scanners, and this was very easy to 
do, as it was sufficient to modify only a small part of an existing virus.  
Teenagers with some basic understanding of programming could do so easily . 
Avoiding heuristic scanners however requires a lot more knowledge, if even 
possible at all.

Fortunately, this detection-avoiding method of programming makes
detection by conventional anti-virus products easier because it
means that the programmer can not use very tight and straight code. The virus 
writer will be forced to write more complex viruses.


The pro's and con's of heuristic scanning
-----------------------------------------
-   Advantages:
        -   Can detect 'future' viruses
        -   User is less dependant on product updates
-   Disadvantages:
        -   False positives are possible
        -   Judgement of the result requires some basic knowledge









Heuristic cleaning
==================

Before we can discuss heuristic cleaning, it is important to know how a virus 
infects a program.

The basic principle is not difficult.  A virus - a program by itself - adds 
itself to the end of the program.  The size of the program increases due to 
this addition of the viral code.  Appending a virus program to another program 
is however not enough, the virus code should also be executed.  To make this 
happen, the virus overwrites the first bytes of the file with a 'jump' 
instruction, which makes the processor jump to the viral code.  The virus now 
gains control when the program is
invoked, and it will finally pass control to the original program. Since the 
first bytes of the file are overwritten by the jump instruction, the virus has 
to 'repair' these bytes first.  After that the virus just jumps to the 
beginning of the original program, and most often this program works as usual.


    original program                      infected program

    +--------------+                      +--------------+
    | p            |                 100: |jump          |
    | r            |                      |to 2487       |
    | o            |                      | o            |
    | g            |                      | g            |
    | r            |                      | r            |
    | a            |                      | a            |
    | m            |                      | m            |
    |              |                      |              |
    | c            |                      | c            |
    | o            |                      | o            |
    | d            |                      | d            |
    | e            |                      | e            |
    |              |                      |              |
    +--------------+                      +--------------+
                                    2487: |              |
                                          |  VIRUS!    p |
                                          |            r |
                                          |jmp 100       |
                                          +--------------+

To clean an infected program, it is of vital importance to restore the bytes 
being overwritten by the jump to the virus code.  The virus has  to restore 
these bytes also, so somewhere in this virus code these original bytes are 
stored.  The cleaner searches for those bytes, puts  them back in their 
original location, and truncates the file to the  original size.






How does a conventional cleaner work?

A conventional cleaner has to know which virus to remove.  Suppose your system 
is infected with the Jerusalem/PLO virus.  You invoke your cleaner and it 
proceeds like this:

"Hey, this file is infected with the Jerusalem/PLO virus.  OK, this virus is 
1873 bytes in size, and it overwrites the first three bytes of the original 
program with a jump to itself.  The original bytes are located at offset 483 
in the viral code.  So, I have to take those bytes, copy them to the beginning 
of the file, and I have to remove 1873 bytes of the file.  That's it!"


Shortcomings of conventional cleaners

The cleaner has to know the virus it has to remove.  It is impossible to 
remove an unknown virus.

The virus should be the same as the virus known to the cleaner.
Imagine what whould happen if the virus used in the example was
modified and now 1869 bytes in size instead of 1873... the cleaner would 
remove too much!  This is not an exception, but it happens quite often since 
there are so many mutants.  For instance, the Jerusalem/PLO family now 
contains more than 100 mutants!

Many polymorphic viruses have variable lengths and maintain the
original instructions encrypted.  Most conventional cleaners are
therefore unable to clean MTE infected programs.


The virus will remove itself before actual execution

We have seen above how a virus works. The interesting part is that when the 
virus passes control to the original program it restores the original bytes at 
the beginning of the program and jumps back to start the program. Every virus 
is able to repair the original program in order to keep it functional (except 
for overwriting viruses, but these can't be cleaned anyway).


Let the virus do the dirty work

The idea is: why not let do the virus the dirty work?  The basic 
principle of heuristic cleaning is simple.  The heuristic cleaner loads the 
infected file and starts emulating the program code.  It uses a combination of 
disassembly, emulation and sometimes execution to trace the flow of the virus, 
and to emulate what the virus is normally doing. When the virus restores the 
original instructions and jumps back to the original program code, the cleaner 
stops the emulation process, and says 'thank you' to the virus for its 
cooperation in restoring the original bytes.  The now repaired start of the 
program is copied back to the program file on disk, and the part of the 
program that gained 'execution' will be chopped off. An additional analysis 
of the cleaned program file will be performed to be on the safe side.

Note that the cleaner is actually removing the unknown from the 
unknown.  No predefined information about the virus or infected file is 
necessary.

The process of emulation is just like hitchhiking. The emulator 
convinces the viral code that it is actually executing, and it 
hitchhikes to the point where the virus passes control to the original 
program.

However, the actual process is very complicated. As with hitchhiking, many 
things can go wrong:

-   Driver takes you to the wrong place
    The virus does not intend to execute the original program, but it 
    starts doing completely different things.  As the purpose of the 
    emulation is to restore the original program, we never reach our goal.

-   Driver won't let you out If the viral code performs an endless 
    loop, the origial program will never be restored so the cleaner might 
    wait forever.

-   Driver leaves the car
    A potentially dangerous situation is that the cleaner is too 
    ambitious in its task to emulate everything, and that the virus 
    gets control inside the emulated environment and finally escapes 
    from it.

-   Driver hits a tree and kills you too
    Many viruses are badly programmed.  If they crash inside the 
    emulator, chances are that the emulator crashes too.

Heuristic cleaners are so complicated that there is only one available right 
now.  However, the great potential of heuristic cleaning make it likely that 
there will be more heuristic cleaners soon.


The pro's and con's (of Hitchhiking)

-   Advantages:
    -   no need to recognize mutants
    -   no problems with polymorphic viruses
    -   can clean 'future' viruses
    -   user is less dependant on product updates
-   Disadvantages:
    -   No exact copy of the original
    -   It cleans everything: even clean files!


Being the author of the first heuristic cleaner I have received many reactions 
to it.  Most people were surprised that my cleaner was able to remove MTE 
viruses before my scanner was even able to recognize them.  This is especially 
interesting as most anti-virus products are still not able to remove MTE 
infections.

Of course everybody wants to know how many viruses can be removed this way.  I 
can't show a reliable figure, as testing a cleaner is extremely tedious and 
time consuming task.  However, a figure of 80% is a rough estimate.  Many 
conventional cleaners do not even come close to this percentage.

What can be expected from it in the future?

Heuristic cleaning needs additional improvements.  Some viruses use 
anti-debugger features that also make an emulator fail.  It is also still 
possible that a virus detects that it is being emulated, and it can simply 
refuse to cooperate.  The better the emulator performs, the less likely this 
is.  Major improvements however are more likely to show up after multiple 
heuristic cleaners are available and some competition occurs.

Frans Veldman,
Author of Thunderbyte Anti-Virus
Chief Executive of ESaSS B.V.