May, 06 1994 Technical information "heuristic scanning" Combatting viruses heuristically ================================ Generally speaking, there are two basic methods to detect viruses specific and generic. Specific virus detection requires the anti-virus program to have some pre-defined information about a specific virus (like a scan string). The anti-virus program must be frequently updated in order to make it detect new viruses as they appear. Generic detection methods however are based on generic characteristics of the virus, so theoretically they are able to detect every virus, including the new and unknown ones. Why is generic detection gaining importance? There are four reasons: 1) The number of viruses increases rapidly. Studies indicate that the total number of viruses doubles roughly every nine months. The amount of work for the virus researcher increases, and the chances that someone will be hit by one of these unrecognizable new viruses increases too. 2) The number of virus mutants increases. Virus source codes are widely spread and many people can't resist the temptation to experiment with them, creating many slightly modified viruses. These modified viruses may or may not be recognized by the anti-virus product. Sometimes they are, but unfortunately often they are not. 3) The development of polymorphic viruses. Polymorphic viruses like MTE and TPE are more difficult to detect with virus scanners. It is often months after a polymophic virus has been discovered before a reliable detection algorithm has been developed. In the meantime many users have an increased chance of being infected by that virus. 4) Viruses directed at a specific organization or company. It is possible for individuals to utilize viruses as weapons. By creating a virus that only works on machines owned by a specific organization or company it is very unlikely that the virus will spread outside of the organization. Thus it is very unlikely that any virus scanner will be able to detect the virus before the payload of the virus does its destructive work and reveals itself. Each of these scenarios demonstrates the fact that virus scanners can not recognize a virus until the virus has been discovered and analyzed by an anti-virus vendor. These same scenarios do not hold true for generic detectors, therefore many people are becoming more interested in generic anti-virus products. Of the many generic detection methods, heuristic scanning is currently becoming the most important. Heuristic scanning ------------------ One of the most time consuming tasks that a virus researcher faces is the examination of files. People often send files to researchers because they believe the files are infected by a new virus. Sometimes these files are indeed infected, sometimes not. Every researcher is able to determine very quickly what is going on by loading the suspected file into a debugger. A few seconds is often enough, and many researchers must have asked themselves: "How can I determine this so quickly"? One time I demonstrated this effect to the audience on an international conference. I showed the first page of the assembly listing of a MTE-infected file, and within about a second, Vesselin Bontchev came with the correct answer. How is this possible? Artificial intelligence Some of the many differences between viruses and normal programs is that normal programs typically start searching the command line for options, clearing the screen, etc. Viruses however never search for command line options or clear the screen. Instead they start with a search for other executable files, by writing to the disk, or by decrypting themselves. A researcher who has loaded the suspected file into a debugger can notice this difference in only a glance. Heuristic scanning is an attempt to put this experience and knowledge into a virus scanner. The word 'heuristic' means (according to a Dutch dictionary) 'the self finding' and 'the knowledge to determine something in a methodic way'. A heuristic scanner is a type of automatic debugger or disassembler. The instructions are disassembled and their purposes are determined. If a program starts with the sequence MOV AH,5 INT 13h which is a disk format instruction for the BIOS, this is highly suspected, especially if the program does not process any command line options or interact with the user. Suspected abilities In reality, heuristics is much more complicated. The heuristic scanners that I am familiar with are able to detect suspicious instruction sequences, like the ability to format a disk, the ability to search for other executables, the ability to remain resident in memory, the ability to issue non-standard or undocumented system calls, etc. Each of these abilities has a value assigned to it. The values assigned to the various suspicious abilities are dependant on various fact. A disk format routine doesn't appear in many normal programs, but often in viruses. So it gets a high value. The abilities to remain resident in memory are found in many normal programs, so despite of the fact that they also appear in many viruses it doesn't get a high value. If the total of the values for one program exceeds a predefined treshold, the scanner yells "Virus!". A single suspected ability is never enough to trigger the alarm. It is always the combination of the suspected abilities which convince the scanner that the file is a virus. Heuristic flags Some scanners set a flag for each suspected ability which has been found in the file being analyzed. This makes it easier to explain to the user what has been found. TbScan for instance recognizes many suspected instruction sequences. Every suspected instruction sequence has a flag assigned to it: Flag Description ---- ----------- F = Suspicious file access. Might be able to infect a file. R = Relocator. Program code will be relocated in a suspicious way. A = Suspicious Memory Allocation. The program uses a non-standard way to search for, and/or allocate memory. N = Wrong name extension. Extension conflicts with program structure. S = Contains a routine to search for executable (.COM or .EXE) files. # = Found an instruction decryption routine. This is common for viruses but also for some protected software. E = Flexible Entry-point. The code seems to be designed to be linked on any location within an executable file. Common for viruses. L = The program traps the loading of software. Might be a virus that intercepts program load to infect the software. D = Disk write access. The program writes to disk without using DOS. M = Memory resident code. This program is designed to stay in memory. ! = Invalid opcode (non-8088 instructions) or out-of-range branch. T = Incorrect timestamp. Some viruses use this to mark infected files. J = Suspicious jump construct. Entry point via chained or indirect jumps. This is unusual for normal software but common for viruses. ? = Inconsistent exe-header. Might be a virus but can also be a bug. G = Garbage instructions. Contains code that seems to have no purpose other than encryption or avoiding recognition by virus scanners. U = Undocumented interrupt/DOS call. The program might be just tricky but can also be a virus using a non-standard way to detect itself. Z = EXE/COM determination. The program tries to check whether a file is a COM or EXE file. Viruses need to do this to infect a program. O = Found code that can be used to overwrite/move a program in memory. B = Back to entry point. Contains code to re-start the program after modifications at the entry-point are made. Very usual for viruses. K = Unusual stack. The program has a suspicious stack or an odd stack. TbScan would for instance output the following flags: Virus Heuristic flags ----- --------------- Jerusalem/PLO: FRLMUZ Backfont: FRALDMUZK Minsk_Ghost: FELDTGUZB Murphy: FSLDMTUZO Ninja: FEDMTUZOBK Tolbuhin: ASEDMUOB Yankee_Doodle: FN#ELMUZB The more flags that are triggered by a file, the more likely it is that the file is infected by a virus. Normal programs rarely trigger even one flag, while at least two flags are required to trigger the alarm. To make it more complicated, not all flags carry the same 'weight'. False positives --------------- Just like all other generic detection techniques, heuristic scanners sometimes blame innocent programs for being contaminated by a virus. This is called a "False Positive" or "False Alarm". The reason for this is simple. Some programs happen to have several suspected abilities. For instance, the LOADHI.COM file of QEMM has the following suspected abilities (according to an older, yet obsolete version of TbScan): A = Suspicious Memory Allocation. The program uses a non-standard way to search for, and/or allocate memory. M = Memory resident code. This program may be a TSR but also a virus. U = Undocumented interrupt/DOS call. The program might be just tricky but can also be a virus using a non-standard way to detect itself. Z = EXE/COM determination. The program tries to check whether a file is a COM or EXE file. Viruses need to do this to infect a program. O = Found code that can be used to overwrite/move a program in memory. All of these abilities are available in LoadHi, and the flags are enough to trigger the heuristic alarm. As LoadHi is supposed to allocate upper memory, load resident programs in memory, move them to upper memory, etc., all these suspected abilities can easily be explained and verified. However, the scanner is not able to know the intended purpose of the program, and as most of these suspected abilities are often found in viruses, it just describes the LoadHi program as "a possible virus". How serious is the issue of false alarms? If a heuristic scanner pops up with a message saying: "This program is able to format a disk and it stays resident in memory", and the program is a resident disk format utility, is this really a false alarm? Actually, the scanner is right. A resident format utility obviously contains code to format a disk, and it contains code to stay resident in memory. The heuristic scanner is therfore completely right! You could name it a false suspicion, but not a false positive. The only problem here is that the scanner says that it might be a virus. If you think the scanner tells you it has found a virus, it turns out to be a false alarm. However, if you take this information as is, saying 'ok, the facts you reported are true for this program, I can verify this so it is not a virus', I wouldn't count it as a false alarm. The scanner just tells the truth. The main problem here is the person who has to make decisions with the information supplied by the scanner. If it is a novice user, it is a problem. More about that later. Avoiding false positives Whether we call it a false positive or a false suspicion doesn't matter. We do not like the scanner to yell every time we scan. So we need to avoid this situation. How do we achieve this? 1) Definition of (combinations of) suspicious abilities The scanner does not issue an alarm unless at least two separate suspected program abilities have been found. 2) Recognition of common program codes Some known compiler codes or run time compression or decryption routines can cause false alarms. These specific compression or decryption codes can be recognized by the scanner to avoid false alarms. 3) Recognition of specific programs Some programs which normally cause a problem (like the LoadHi program used in the example) can be recognized by the heuristic scanner. 4) Assumption that the machine is initially not infected Some heuristic scanners have a 'learn' mode, i.e. they are able to learn that a file causing a false alarm is not a virus. Dealing with false positives Some false positives are not easily avoided. So, the user has to deal with a certain amount of false alarms, and must make the final decision as to whether a file is infected or not. Ok, you may say, how do we know whether a suspicious program is a virus or innocent. There is no way to find out, that is what most people believe. Actually there is a way to find out, but this depends on the scanner. The scanner has to explain to the user the reasons why the program is suspect. 'This file might contain a virus' actually doesn't say much to the user. It is always right. Every file MIGHT contain a virus, but MAY also be clean. We actually use a scanner to find out! What is the user supposed to do with this information? However, if the scanner says that some program is able to remain resident in memory and able to format a disk, the user can more easily figure out what is going on. If a word processor gives such an alarm, it is extremely likely that the program carries a virus, because word processors generally are not able to format disks and remain resident in memory. However, if the suspected file is a resident disk formatting utility, then all of the suspected abilities can be explained by the intended purpose of the program. Reason for suspicion: memory resident and disk formatting abilities. Program Probably ------------------------ -------- Resident disk formatter No Virus (innocent) Word processor Malicious (virus) Both programs cause the same heuristic alarms, but the final conclusion is different. Naturally, it requires an advanced user to draw a conclusion for the question "infected or not?". However, my opinion is that judging the results of any scanner (also conventional scanners) is a task for an advanced user only. If the scanner has a 'learn' mode, i.e. is able to remember which programs cause a false alarm, the initial scan should be performed by an advanced user, but the subsequent scans (when the possible false positives have been eliminated) can be performed by a novice user. This is already common practice in most organizations. Anyway, it isn't as bad as it seems, as all other detection methods (including signature scanning) are known to cause some false alarms as well. Heuristics however has the advantage that it is able to supply you with enough information to establish for yourself whether a suspected file is likely a virus or not. How does heuristic scanning perform? ------------------------------------ Heuristics is a relatively new technique and still under development. It is however gaining importance rapidly. This is not surprising as heuristic scanners are able to detect over 90% of the viruses without using any predefined information like signatures or checksum values. The amount of false positives depends on the scanner, but a figure as low as 0.1% can be reached easily. TbScan 6.02 used on the large virus collection of Vesselin Bontchev showed the following results: Scanning 7210 detection method files percentage ------------- ----- ---------- Conventional 7056 97.86% Heuristics 6465 89.67% A false positive test however is more difficult to perform so there are no independent results available. Combination of conventional and heuristic scanning -------------------------------------------------- Some people think heuristic scanning is a replacement for conventional scanning. In my opinion it is not. Heuristic scanning serves a very useful purpose when used in combination with conventional scanning. The results of both scanning methods can be validated by each other, thereby reducing false positives and also false negatives. Combined result of analysis: Heuristic Conventional Probability clean clean very probably clean clean virus might be a false positive virus clean might be a false negative virus virus very probably infected fn: 10% fn: 1% combined false negatives: 0.1% fp: 0.1% fp: 0.001% combined false positives: 0.00001% The chances of both the heuristic scanner and the conventional scanner failing is minimal. If both scanning methods have the same results, the result is almost certain. In the few cases that the results don't agree with each other additional analysis is required. TbScan 6.02 used on the large virus collection of Vesselin Bontchev showed the following results: Scanning 7210 detection method files percentage ------------- ----- ---------- Conventional 7056 97.86% Heuristics 6465 89.67% Combined 7194 99.78% What can be expected from it in the future? ------------------------------------------- The development continues Most anti-virus developers still do not supply a ready-to-use heuristic analyzer. Those who have heuristics already available are still improving it. It is however unlikely that the detection rate will ever reach 100% without a certain amount of false positives. On the other hand it is unlikely that the amount of false positives will ever reach 0%. Maybe you wonder why it isn't possible to achieve 100% correct results. There is a large grey area between viruses and non-viruses. Even for humans it is hard to describe what a virus is or not, an often used definition of a computer virus is this: "A virus is a program that is able to copy itself". According to this definition the DiskCopy.Com program is a virus... Reaction of virus writers An important issue is the effect on virus writers. It is likely that they will try to avoid detection by heuristic scanners. Until now the goal was to avoid detection by signature scanners, and this was very easy to do, as it was sufficient to modify only a small part of an existing virus. Teenagers with some basic understanding of programming could do so easily . Avoiding heuristic scanners however requires a lot more knowledge, if even possible at all. Fortunately, this detection-avoiding method of programming makes detection by conventional anti-virus products easier because it means that the programmer can not use very tight and straight code. The virus writer will be forced to write more complex viruses. The pro's and con's of heuristic scanning ----------------------------------------- - Advantages: - Can detect 'future' viruses - User is less dependant on product updates - Disadvantages: - False positives are possible - Judgement of the result requires some basic knowledge Heuristic cleaning ================== Before we can discuss heuristic cleaning, it is important to know how a virus infects a program. The basic principle is not difficult. A virus - a program by itself - adds itself to the end of the program. The size of the program increases due to this addition of the viral code. Appending a virus program to another program is however not enough, the virus code should also be executed. To make this happen, the virus overwrites the first bytes of the file with a 'jump' instruction, which makes the processor jump to the viral code. The virus now gains control when the program is invoked, and it will finally pass control to the original program. Since the first bytes of the file are overwritten by the jump instruction, the virus has to 'repair' these bytes first. After that the virus just jumps to the beginning of the original program, and most often this program works as usual. original program infected program +--------------+ +--------------+ | p | 100: |jump | | r | |to 2487 | | o | | o | | g | | g | | r | | r | | a | | a | | m | | m | | | | | | c | | c | | o | | o | | d | | d | | e | | e | | | | | +--------------+ +--------------+ 2487: | | | VIRUS! p | | r | |jmp 100 | +--------------+ To clean an infected program, it is of vital importance to restore the bytes being overwritten by the jump to the virus code. The virus has to restore these bytes also, so somewhere in this virus code these original bytes are stored. The cleaner searches for those bytes, puts them back in their original location, and truncates the file to the original size. How does a conventional cleaner work? A conventional cleaner has to know which virus to remove. Suppose your system is infected with the Jerusalem/PLO virus. You invoke your cleaner and it proceeds like this: "Hey, this file is infected with the Jerusalem/PLO virus. OK, this virus is 1873 bytes in size, and it overwrites the first three bytes of the original program with a jump to itself. The original bytes are located at offset 483 in the viral code. So, I have to take those bytes, copy them to the beginning of the file, and I have to remove 1873 bytes of the file. That's it!" Shortcomings of conventional cleaners The cleaner has to know the virus it has to remove. It is impossible to remove an unknown virus. The virus should be the same as the virus known to the cleaner. Imagine what whould happen if the virus used in the example was modified and now 1869 bytes in size instead of 1873... the cleaner would remove too much! This is not an exception, but it happens quite often since there are so many mutants. For instance, the Jerusalem/PLO family now contains more than 100 mutants! Many polymorphic viruses have variable lengths and maintain the original instructions encrypted. Most conventional cleaners are therefore unable to clean MTE infected programs. The virus will remove itself before actual execution We have seen above how a virus works. The interesting part is that when the virus passes control to the original program it restores the original bytes at the beginning of the program and jumps back to start the program. Every virus is able to repair the original program in order to keep it functional (except for overwriting viruses, but these can't be cleaned anyway). Let the virus do the dirty work The idea is: why not let do the virus the dirty work? The basic principle of heuristic cleaning is simple. The heuristic cleaner loads the infected file and starts emulating the program code. It uses a combination of disassembly, emulation and sometimes execution to trace the flow of the virus, and to emulate what the virus is normally doing. When the virus restores the original instructions and jumps back to the original program code, the cleaner stops the emulation process, and says 'thank you' to the virus for its cooperation in restoring the original bytes. The now repaired start of the program is copied back to the program file on disk, and the part of the program that gained 'execution' will be chopped off. An additional analysis of the cleaned program file will be performed to be on the safe side. Note that the cleaner is actually removing the unknown from the unknown. No predefined information about the virus or infected file is necessary. The process of emulation is just like hitchhiking. The emulator convinces the viral code that it is actually executing, and it hitchhikes to the point where the virus passes control to the original program. However, the actual process is very complicated. As with hitchhiking, many things can go wrong: - Driver takes you to the wrong place The virus does not intend to execute the original program, but it starts doing completely different things. As the purpose of the emulation is to restore the original program, we never reach our goal. - Driver won't let you out If the viral code performs an endless loop, the origial program will never be restored so the cleaner might wait forever. - Driver leaves the car A potentially dangerous situation is that the cleaner is too ambitious in its task to emulate everything, and that the virus gets control inside the emulated environment and finally escapes from it. - Driver hits a tree and kills you too Many viruses are badly programmed. If they crash inside the emulator, chances are that the emulator crashes too. Heuristic cleaners are so complicated that there is only one available right now. However, the great potential of heuristic cleaning make it likely that there will be more heuristic cleaners soon. The pro's and con's (of Hitchhiking) - Advantages: - no need to recognize mutants - no problems with polymorphic viruses - can clean 'future' viruses - user is less dependant on product updates - Disadvantages: - No exact copy of the original - It cleans everything: even clean files! Being the author of the first heuristic cleaner I have received many reactions to it. Most people were surprised that my cleaner was able to remove MTE viruses before my scanner was even able to recognize them. This is especially interesting as most anti-virus products are still not able to remove MTE infections. Of course everybody wants to know how many viruses can be removed this way. I can't show a reliable figure, as testing a cleaner is extremely tedious and time consuming task. However, a figure of 80% is a rough estimate. Many conventional cleaners do not even come close to this percentage. What can be expected from it in the future? Heuristic cleaning needs additional improvements. Some viruses use anti-debugger features that also make an emulator fail. It is also still possible that a virus detects that it is being emulated, and it can simply refuse to cooperate. The better the emulator performs, the less likely this is. Major improvements however are more likely to show up after multiple heuristic cleaners are available and some competition occurs. Frans Veldman, Author of Thunderbyte Anti-Virus Chief Executive of ESaSS B.V.