NetNews Usenet Archive 1992 #31

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #31 / NN_1992_31.iso / spool / comp / software / 5097 next >

Wrap

Text File | 1992-12-21 | 47.0 KB | 1,125 lines

Newsgroups: comp.software-eng Path: sparky!uunet!spool.mu.edu!sgiblab!munnari.oz.au!metro!usage!syacus!paulb From: paulb@syacus.acus.oz.au (Paul Bandler) Subject: Value of Code Coverage Analysis Metrics - Summary Message-ID: <1992Dec21.061509.25208@syacus.acus.oz.au> Organization: ACUS Australian Centre for Unisys Software, Sydney Date: Mon, 21 Dec 1992 06:15:09 GMT Lines: 1115 Sender: news@lmpsbbs.comm.mot.com (Net News) Organization: Motorola Land Mobile Products Sector Lines: 55 Nntp-Posting-Host: 145.16.3.73 In article <1992Dec14.072812.13689@syacus.acus.oz.au>, paulb@syacus.acus.oz.au (Paul Bandler) writes: <preamble deleted> |> I believe we have a tool to measure the %BFA 'Branch Flow Analysis' but |> of course the engineers are responsible for producing the test cases to |> exercise the code. |> |> I have 3 questions:- |> |> 1) Do people think that this is a valuable metric? |> 2) Is it a cost effective excersize to get engineers to achieve a particular |> %BFA as a completion criteria? |> 3) What is a realistic %BFA to aim for? |> Paul Bandler BFA is a valuable tool for testing. %BFA, as with all unitless numbers, is of a more dubious nature. %BFA does not tell the person looking at it which paths were not tested and why. This is important because engineers could, subconsciously or otherwise, use their quota of untested branches to avoid testing the more complex/niggly areas of code. The nature of the testing is also important - only if you are programming in an interpreted language where syntax must be checked by execution, is executing every line of code useful by itself. (You can run through every branch of a square root function but it won't test for what happens with a negative parameter. There is always a temptation to say that a line i++ 'passes because it increments i' without checking when it shouldn't.) As to cost effectiveness this is so dependant upon the available tools, maintainance costs, whether or not testability was built into the code etc. that any global statement would be rash. A realistic %BFA target is again dependant upon specific circumstances. I don't know if anyone has done any work to find out the %age of defects found, against testing coverage and time taken but I would be interested to find out. The benefits of higher percentages are not linear and certainly peaks. Setting %BFA as a part of completion criteria is probably worthwhile but getting dogmatic about it is probably not. BFA is very useful for helping engineers see where more testing is required it is not that useful as an exam mark for the testing. -- David ========= All opinions are mine and not necessarily Motorola's ============ = @mail : David Alexander, Channel Tunnel Software, Motorola, Lyon Way, = = Camberley (ZUK20), Surrey GU15 3QG, U.K. = = Email : (Internet) davidal@comm.mot.com Motorola X400-gateway : CDA004 = = Telephone : (office) +44 (0)276-413340 (home)+44 (0)276-24249 = =========================================================================== Subject: >>> Value of High Code Coverage Metrics in Testing - Request for Opinion Sender: usenet@news.eng.convex.com (news access account) Message-ID: <ssimmons.724336081@convex.convex.com> Date: Mon, 14 Dec 1992 12:28:01 GMT Nntp-Posting-Host: pixel.convex.com Organization: Engineering, CONVEX Computer Corp., Richardson, Tx., USA X-Disclaimer: This message was written by a user at CONVEX Computer Corp. The opinions expressed are those of the user and not necessarily those of CONVEX. Lines: 27 Performing test coverage analysis on code is more of a good attitude than it is a silver bullet for removing bugs. There is an old mgmt saying, "What is inspected is what gets done". > 1) Do people think that this is a valuable metric? Again, it improves the code but it does not remove all bugs. Design flaws about cases not being handled cannot be found. Also, any code that uses data driven tables (e.g. finite state parsers) cannot be measured effectively. > 2) Is it a cost effective excersize to get engineers to achieve a particular > %BFA as a completion criteria? Sure... if you have the time in the schedule and effective tools to do it. Usually, it is best to have people do coverage analysis on their own code and have people who don't know the code test it for unanticipated conditions. > 3) What is a realistic %BFA to aim for? Fairly low percentage of 50% is usually the maximum possible value. Much code is assertion testing. However, every condition should be accounted. Thank you. Steve Simmons From comp.software-eng Wed Dec 16 08:41:18 1992 Newsgroups: comp.software-eng Path: syacus!usage!metro!munnari.oz.au!spool.mu.edu!sdd.hp.com!zaphod.mps.ohio-state.edu!menudo.uh.edu!sugar!claird From: claird@NeoSoft.com (Cameron Laird) Subject: Re: >>> Value of High Code Coverage Metrics in Testing - Request for Opinion Organization: NeoSoft Communications Services -- (713) 684-5900 Date: Mon, 14 Dec 1992 14:16:41 GMT Message-ID: <Bz96Bu.EBw@NeoSoft.com> References: <ssimmons.724336081@convex.convex.com> Lines: 31 In article <ssimmons.724336081@convex.convex.com> ssimmons@convex.com (Steve Simmons) writes: . . . >> 2) Is it a cost effective excersize to get engineers to achieve a particular >> %BFA as a completion criteria? > >Sure... if you have the time in the schedule and effective tools to do it. PARTICULARLY if you don't have time in the schedule. Try it; in my experience, people learn a lot from their first coverage exercises. It's far better to learn those things *before* shipping the product. >Usually, it is best to have people do coverage analysis on their own code >and have people who don't know the code test it for unanticipated conditions. > >> 3) What is a realistic %BFA to aim for? > >Fairly low percentage of 50% is usually the maximum possible value. Much >code is assertion testing. However, every condition should be accounted. 85%. Serious. . . . If you're lucky, Brian Marick will tune in to this conversation; he's the gentleman with the most ex- perience and insight on this topic. -- Cameron Laird claird@Neosoft.com (claird%Neosoft.com@uunet.uu.net) +1 713 267 7966 claird@litwin.com (claird%litwin.com@uunet.uu.net) +1 713 996 8546 To: paulb@syacus.acus.oz.au Subject: Re: Value of High Code Coverage Metrics in Testing - Request for Opinion Newsgroups: comp.software-eng References: <1992Dec14.072812.13689@syacus.acus.oz.au> Status: RO In comp.software-eng you write: >I have 3 questions:- >1) Do people think that this is a valuable metric? >2) Is it a cost effective excersize to get engineers to achieve a particular > %BFA as a completion criteria? >3) What is a realistic %BFA to aim for? >Paul Bandler Brian Marick has written a couple of papers that answer your questions; they are "Experience with the Cost of Test Suite Coverage Measures" and "Three Ways to Improve Your Testing". They discuss various types of code coverage, and how useful they are. Both are available by anonymous ftp from <something>.cs.uiuc.edu. Unfortunately, I don't remember what the <something> is; however, you could mail him at marick@cs.uiuc.edu, and I'm sure he'd be glad to tell you how to get them. --Samuel Bates samuel@cs.wisc.edu Date: Mon, 14 Dec 92 09:25:53 PST From: Todd Huffman <huffman@yoko.STAT.ORST.EDU> Message-Id: <9212141725.AA21682@yoko> To: paulb%syacus.acus.OZ.AU Subject: Re: Value of High Code Coverage Metrics in Testing - Request for Opinion Newsgroups: comp.software-eng In-Reply-To: <1992Dec14.072812.13689@syacus.acus.oz.au> Organization: Oregon State University Math Department Cc: Status: RO Branch coverage metrics are quite useful, and they are accepted by the software engineering community at large as being useful. The only consideration to work out is efficiency of getting these numbers and cost vs. benefit for the organization. I heard a good comment from a seminar by Tsun Chow (he's at ATT Naperville, Ill.). If your line coverage is not 100% then you are shipping software that has never been executed. Sounds risky! If your branch coverage is not 100% then you are shipping software with branches that have never been taken. Also risky. My experience is that this sort of metric collection must be automated if it is to be cost effective. It must be done at the UNIT test stage. Something 80 - 100% would be useful. If programmers do not have to get 80% coverage, then you will have some of them release barely tested modules when the schedule gets tight. This sort of metric must be put in perspective with the whole QA program. Even with 100% coverage there will remain bugs. I think it is important to track bugs discovered by the integration/system test group per module. Then you will know which programmers have released buggy code to test. The 80% (or whatever you choose) branch coverage level should prevent the very bad code from being release and subsequent code churning. Here is a reference where test coverage metrics are used quite well-- "Experience in Testing the Motif Interface", Jason Su, Paul Ritter (they're at Hewlett-Packard). March 1991, IEEE Software. That whole issue is devoted to testing--other articles are also good. Thats all for my 2 cents worth. Todd Huffman Received: from hotel.mitre.org.w151_sparc by milner.mitre.org.w151_sparc (4.1/SMI-4.1) id AA16924; Mon, 14 Dec 92 12:06:24 EST Date: Mon, 14 Dec 92 12:06:24 EST From: drodman@milner.mitre.org (David B Rodman) Message-Id: <9212141706.AA16924@milner.mitre.org.w151_sparc> To: paulb@syacus.acus.oz.au Subject: A Brian Marick testing case study (long) Status: RO ------- Forwarded Message Replied: Mon, 06 Jul 92 16:28:06 -0400 Replied: "Brian Marick <marick@cs.uiuc.edu> " Received: from mwunix.mitre.org by milner.mitre.org.w151_sparc (4.1/SMI-4.1) id AA10005; Mon, 6 Jul 92 16:25:27 EDT Return-Path: <marick@cs.uiuc.edu> Received: from m.cs.uiuc.edu by mwunix.mitre.org (5.61/SMI-2.2) id AA07558; Mon, 6 Jul 92 16:23:11 -0400 Received: by m.cs.uiuc.edu id AA15701 (5.64+/IDA-1.3.4 for vecellio@milner.mitre.org); Mon, 6 Jul 92 15:26:15 -0500 Date: Mon, 6 Jul 92 15:26:15 -0500 From: Brian Marick <marick@cs.uiuc.edu> Message-Id: <9207062026.AA15701@m.cs.uiuc.edu> To: vecellio@milner.mitre.org Cc: vecellio@milner.mitre.org In-Reply-To: Gary Vecellio's message of Mon, 06 Jul 92 16:20:32 -0400 <9207062020.AA09946@milner.mitre.org.w151_sparc> Subject: A testing case study (long) A CASE STUDY IN COVERAGE TESTING Brian Marick Testing Foundations Abstract I used a C coverage tool to measure the quality of its own test suite. I wrote new tests to improve the coverage of a 2600-line segment of the tool. I then reused and extended those tests for the next release, which included a complete rewrite of that segment. The experience reinforced my beliefs about coverage-based testing: 1. A thorough test suite should achieve nearly 100% feasible coverage. 2. Adding tests for additional coverage can be cheap and effective. 3. To be effective, testing should not be a blind attempt to achieve coverage. Instead, use coverage as a signal that points to weakly-tested parts of the specification. 4. In addition to suggesting new tests, coverage also tells you when existing tests aren't doing what you think, a common problem. 5. Coverage beyond branch coverage is worthwhile. 6. Even with thorough testing, expect documentation, directed inspections, beta testing, and customers to find bugs, especially design and specification bugs. The Generic Coverage Tool GCT is a freeware coverage tool for C programs, based on the GNU C compiler. It measures these kinds of coverage: - - branch coverage (every branch must be taken in both directions) - - multi-condition coverage (in 'if (a && b)', both subexpressions must evaluate to true and false). - - loop coverage (require loop not to be taken, to be traversed exactly once, and traversed more than once) - - relational coverage (require tests for off-by-one errors) - - routine entry and call point coverage. - - race coverage (extension to routine coverage for multiprocessing) - - weak mutation coverage (a research technique) (For more, see [Marick92].) The tool comes with a large regression test suite, developed in parallel with the code, using a "design a little, test a little, code a little" approach, much like that described in [Rettig91]. About half the original development time was spent in test construction (with, I believe, a corresponding reduction in the amount of frantic debugging when problems were found by users - though of course there was some of that). Most of the tests are targetted to particular subsystems, but they are not unit tests. That is, the tests invoke GCT and deduce subsystem correctness by examining GCT's output. Only a few routines are tested in isolation using stubs - that's usually too expensive. When needed, test support code was built into GCT to expose its internal state. In early releases, I had not measured the coverage of GCT's own test suite. However, in planning the 1.3 release, I decided to replace the instrumentation module with two parallel versions. The original module was to be retained for researchers; commercial users would use a different module that wouldn't provide weak mutation coverage but would be superior in other ways. Before redoing the implementation, I wanted the test suite to be solid, because I knew a good test suite would save implementation time. Measuring Coverage I used branch, loop, multi-condition, and relational coverage. I'm not convinced weak mutation coverage is cost-effective. Here were my initial results for the 2617 lines of code I planned to replace. (The count excludes comments, blank lines, and lines including only braces.) BINARY BRANCH INSTRUMENTATION (402 conditions total) 47 (11.69%) not satisfied. 355 (88.31%) fully satisfied. SWITCH INSTRUMENTATION (90 conditions total) 14 (15.56%) not satisfied. 76 (84.44%) fully satisfied. LOOP INSTRUMENTATION (24 conditions total) 5 (20.83%) not satisfied. 19 (79.17%) fully satisfied. MULTIPLE CONDITION INSTRUMENTATION (390 conditions total) 56 (14.36%) not satisfied. 334 (85.64%) fully satisfied. OPERATOR INSTRUMENTATION (45 conditions total) ;; This is relational coverage 7 (15.56%) not satisfied. 38 (84.44%) fully satisfied. SUMMARY OF ALL CONDITION TYPES (951 total) 129 (13.56%) not satisfied. 822 (86.44%) fully satisfied. These coverage numbers are consistent with what I've seen using black box unit testing combined with judicious peeks into the code. (See [Marick91].) I do not target coverage in my test design; it's more important to concentrate on the specification, since many important faults will be due to omitted code [Glass81]. When the uncovered conditions were examined more closely (which took less than an hour), it was clear that the tests were more thorough than appears from the above. The 129 uncovered conditions broke down as follows: 28 were impossible to satisfy (sanity checks, loops with fixed bounds can't be executed 0 times, and so on). 46 were support code for a feature that was never implemented (because it turned out not to be worthwhile); these were also impossible to exercise. 17 were from temporary code, inserted to work around excessive stack growth on embedded systems. It was always expected to be removed, so was not tested. 24 were due to a major feature, added late, that had never had regression tests written for it. 14 conditions corresponded to 10 untested minor features. All in all, the test suite had been pleasingly thorough. New Tests Prior to the Rewrite I spent 4 hours adding tests for the untested major feature. I was careful not to concentrate on merely achieving coverage, but rather on designing tests based on what the program was supposed to do. Coverage is seductive - like all metrics, it is only an approximation of what's important. When "making the numbers" becomes the prime focus, they're often achieved at the expense of what they're supposed to measure. This strategy paid off. I found a bug in handling switches within macros. A test designed solely to achieve coverage would likely have missed the bug. (That is, the uncovered conditions could have been satisfied by an easy - but inadequate - test.) There was another benefit. Experience writing these tests clarified design changes I'd planned to make anyway. Writing tests often has this effect. That's why it's good to design tests (and write user documentation) as early as possible. I spent two more hours testing the minor features. I did not write tests for features that were only relevant to weak mutation. Branch coverage discovered one pseudo-bug: dead code. A particular special case check was incorrect. It was testing a variable against the wrong constant. This check could never be true, so the special case code was never executed. However, the special case code turned out to have the same effect as the normal code, so it was removed. (This fault was most likely introduced during earlier maintenance.) At this point, tests written because of multi-condition, loop, and relational coverage revealed no bugs. My intuitive feel was that the tests were not useless - they checked situations that might well have led to failures, but didn't. I reran the complete test suite overnight and rechecked coverage the next day. One test error was discovered; a typo caused the test to miss checking what it was designed to test. Rechecking took 1/2 hour. Reusing the Test Suite The rewrite of the instrumentation module was primarily a re-implementation of the same specification. All of the test suite could be reused, and there were few new features that required new tests. (I did go ahead and write tests for the weak mutation features I'd earlier ignored.) About 20% of the development time was spent on the test suite (including new tests, revisions to existing tests, and a major reorganization of the suite's directory structure and controlling scripts). The regression test suite found minor coding errors; they always do, in a major revision like this. It found no design flaws. Rewriting the internal documentation (code headers) did. (After I finish code, I go back and revise all the internal documentation. The shift in focus from producing code to explaining it to an imaginary audience invariably suggests improvements, usually simplifications. Since I'm a one-man company, I don't have the luxury of team code reads.) The revised test suite achieved 96.47% coverage. Of 37 unsatisfied conditions: 27 were impossible to satisfy. 2 were impossible to test portably (GNU C extensions). 2 were real (though minor) omissions. 1 was due to a test that had been misplaced in the reorganization. 5 were IF tests that had been made redundant by the rewrite. They were removed. It took an hour to analyse the coverage results and write the needed tests. They found no bugs. Measuring the coverage for the augmented test suite revealed that I'd neglected to add one test file to the test suite's controlling script. Other Tests The 1.3 release also had other features, which were duly tested. For one feature, relational operator coverage forced the discovery of a bug. A coverage condition was impossible to satisfy because the code was wrong. I've found that loop, multi-condition, and relational operator coverage are cheap to satisfy, once you've satisfied branch coverage. This bug was severe enough that it alone justified the time I spent on coverage beyond branch. Impossible conditions due to bugs happen often enough that I believe goals like "85% coverage" are a mistake. The problem with such goals is that you don't look at the remaining 15%, deciding, without evidence, that they're either impossible or not worth satisfying. It's better - and not much more expensive - to decide each case on its merits. What Testing Missed Three bugs were discovered during beta testing, one after release (so far). I'll go into some detail, because they nicely illustrate the types of bugs that testing tends to miss. The first bug was a high level design omission. No testing technique would force its discovery. ("Make sure you instrument routines with a variable number of arguments, compile them with the GNU C compiler, and do that on a machine where GCC uses its own copy of <varargs.h>.") This is exactly the sort of bug that beta testing is all about. Fixing the bug required moderately extensive changes and additions, always dangerous just before a release. Sure enough, the fix contained two bugs of its own (perhaps because I rushed to meet a self-imposed deadline). - - The first was a minor design omission. Some helpful code was added to warn GCC users iff they need to worry about <varargs.h>. This code made an assumption that was violated in one case. Coverage would not force a test to detect this bug, which is of the sort that's fixed by changing if (A && B) to if (A && B && C) It would have been nice if GCT would have told me that "condition C, which you should have but don't, was never false", but this is more than a coverage tool can reasonably be expected to do. I found the bug by augmenting the acceptance test suite, which consists of instrumenting and running several large "real" programs. (GCT's test suite contains mostly small programs.) Instrumenting a new real program did the trick. - - As part of the original fix, a particular manifest constant had to be replaced by another in some cases. I missed one of the cases. The result was that a few too few bytes of memory were allocated for a buffer and later code could write past the end. Existing tests did indeed force the bytes to be written past the end; however, this didn't cause a failure on my development machine (because the memory allocator rounds up). It did cause a failure on a different machine. Memory allocation bugs, like this one and the next, often slip past testing. The final bug was a classic: freeing memory that was not supposed to be freed. None of the tests caused the memory to be reused after freeing, but a real program did. I can envision an implementable type of coverage that would force detection of this bug, but it seems as though a code-read checklist ought to be better. I use such a checklist, but I still missed the bug. References [Rettig91] Marc Rettig, "Practical Programmer: Testing Made Palatable", CACM, May, 1991. [Marick91] Brian Marick, "Experience with the Cost of Different Coverage Goals for Testing", Pacific Northwest Software Quality Conference, 1991. [Marick92] Brian Marick, "A Tutorial Introduction to GCT", "Generic Coverage Tool (GCT) User's Guide", "GCT Troubleshooting", "Using Race Coverage with GCT", and "Using Weak Mutation Coverage with GCT", Testing Foundations, 1992. [Glass81] Robert L. Glass. "Persistent Software Errors", Transactions on Software Engineering", vol. SE-7, No. 2, pp. 162-168, March, 1981. Brian Marick, marick@cs.uiuc.edu, uiucdcs!marick Testing Foundations: Consulting, Training, Tools. ------- End of Forwarded Message From drodman@milner.mitre.org@usage.csd.unsw.oz Tue Dec 15 04:05:25 1992 From: drodman@milner.mitre.org (David B Rodman) Message-Id: <9212141702.AA16882@milner.mitre.org.w151_sparc> To: paulb@syacus.acus.oz.au Subject: Value of High Code Coverage Metrics in Testing Cc: drodman@milner.mitre.org Status: RO >I have 3 questions:- >1) Do people think that this is a valuable metric? >2) Is it a cost effective excersize to get engineers to achieve a particular > %BFA as a completion criteria? >3) What is a realistic %BFA to aim for? 1) (is a valuable metric?) My personal opinion is that it is. However, it must be taken within the context of a comprehensive test program and not as a sole metric. Also the metric should not be abused by the development engineers as an excuse for thought! What I mean by this is the drive to achieve high branch coverage sometimes causes engineers to design simple tests which get high coverage but result in missing errors. example pseudo code: if value greater than 0.0 then next value = 1/sqr(value) /comment/ *error* should be sqroot() end_if simple test cases would be 1, -1 and maybe 0.0, However all of these test cases produce the same result as the correct code, leaving what could be a very hard to find error to s/w integration. 2) (Is it a cost effective excersize) I do not have any numbers available, but I have a copy of "Brian Marick <marick@cs.uiuc.edu> " "A CASE STUDY IN COVERAGE TESTING" (previously posted) He has done some measurements based on his own programming habits. I will forward as a seperate posting. 3) (What is a realistic %BFA to aim for?) I would make it 100%, then make exceptions on a case by case basis. Exception handling code, spurious error code and other generally unproducable actions are good cadidates for exceptions. I have some interest in other opinions/answers to your questions. Could you please return/post a summary of resonses to your questions. Thanks David Rodman MITRE Corporation Mclean, VA drodman@mitre.org From Michael_P._Kirby.roch803@xerox.com@usage.csd.unsw.oz Tue Dec 15 04:02:29 1992 Date: Mon, 14 Dec 1992 08:04:38 PST From: Michael_P._Kirby.roch803@xerox.com Subject: Re: Value of High Code Coverage Metrics in Tes To: paulb@syacus.acus.oz.au Cc: kirby.roch803@xerox.com Reply-To: Michael_P._Kirby.roch803@xerox.com Message-Id: <"14-Dec-92 11:04:38 EST".*.Michael_P._Kirby.roch803@Xerox.com> Status: RO Received: by milo.isdl (4.1/SMI-4.0) id AA07499; Mon, 14 Dec 92 11:04:33 EST Paul, Do you guys already practice software inspections? From my experience it is very good practice to hold formal software inspections. There are many good articles that describe both the process and the results Russle, "Experience withInspection in Ultralarge Scale Developments," IEEE Software, January 1991 Fagan, "Design and code inspections to reduce errors in program development," IBM Systems Journal, No.3, 1976, pp.184-211. Fagan, "Advances in Software Inspections," IEEE Transactions on Software Engineering, July 1986, pp744-751 These are just a couple. I have talked to several people who say that once good inspections are in place, unit testing becomes unnessesary. System testing, of course, is still very important. As for BFA, I don't have any experience with it, but I'm of the philosophy that testing by exaustion is not feasable. Therefore BFA only tells you how close you are to a reasonable test coverage. Perhaps an alternative approach is to apply some kind of reliability growth modeling. (John Musa has written several books on the subject). Here the idea is that we test based on a customer operational profile. (i.e. we test features, not code path's). We then model the number of "failures" that a customer will see. From this we can set a threshold of failures/unit time that is "acceptable" to the customer. At this point the product is ready to ship. After shipping the product we can do realiability modeling to deterimine how good our integration testing really was. These are just other ideas. Mike Kirby Xerox Corp E-mail: kirby.roch803@xerox.com From lawrence@uk.ate.slb.com@usage.csd.unsw.oz Wed Dec 16 04:07:02 1992 Date: Tue, 15 Dec 92 14:11:24 GMT From: lawrence%ukfca1@sj.ate.slb.com Message-Id: <9212151411.AA07293@juniper.uk.ate.slb.com> To: paulb@syacus.acus.oz.au Subject: Code coverage metrics. Status: RO Paul, At the current site where I am working the Sun utility 'tcov' is used to give a simple metric for test coverage. As I understand it tcov gives a form of BFA based on statements. The powers that be here have decided that all code will be tested to give a minimum of 90% coverage. I don't know how the figure 90% was reached, I suspect it just sounded good. All the same, looking at the history of software engineering at this site, including a "test coverage" criterion in the engineering standards has been beneficial. The main benefit has been that engineers now think about coverage when designing the tests. (Yes, unit testing is done by the engineers here). The down side is that the engineers are aware of the (probably arbitrary) coverage figure when designing/coding and this affects their efforts. In particular, engineers are encouraged to introduce a lot of executable code (rather than making it table/data driven) since they know that this will increase the amount of code which will be exercised with "easy" test cases, and the amount of code which is hard to test will be relatively small. If they can get the "hard" test cases into the 10% which doesn't have to be tested then it makes their job easier/quicker. Of course the brownie points are awarded for speed rather than quality. I hope this is of interest to you. I would appreciate a summary of what you learn. maybe you could mail me as I don't always have access to the news feed. (PS: I forgot to mention - this is on respnse to Article 6755 in comp.software-eng) Glenn < lawrence@uk.ate.slb.co > From philip@mama.research.canon.oz Tue Dec 15 18:02:17 1992 Date: Tue, 15 Dec 92 17:45:29 EST From: philip@research.canon.oz.au (Philip Craig) Message-Id: <9212150645.AA01582@denis.research.canon.oz.au> To: paulb@syacus.acus.oz.au Subject: Re: Value of High Code Coverage Metrics in Testing - Request for Opinion Newsgroups: comp.software-eng In-Reply-To: <1992Dec14.072812.13689@syacus.acus.oz.au> Organization: Canon Information Systems Research Australia Status: RO In article <1992Dec14.072812.13689@syacus.acus.oz.au> you write: >I have been tasked with the Quality Assurance of a Product Development. One of >the metrics I have been asked to measure is the 'Branch Flow Analysis' (BFA) >percentage achieved during unit and system testing. i.e. How much of the >potential paths through the code is actually excersized during these >testing phases. > >I believe we have a tool to measure the %BFA but of course the engineers are >responsible for producing the test cases to excersize the code. > >I have 3 questions:- > >1) Do people think that this is a valuable metric? >2) Is it a cost effective excersize to get engineers to achieve a particular > %BFA as a completion criteria? >3) What is a realistic %BFA to aim for? Hi Paul, I'm responsible for some testing here at CISRA, so I'd be thrilled to get copies of any replies that you get (apart from me-too ones!). what are you using to measure BFA? GCT? It seems to me that it can be valuable objective quantitative metric, particularly if the people writing the test suites are *not* initially striving for a particular BFA percentage. That is, have them write tests to test functionality, function points, or whatever, and then look at the BFA percentage. This can tell you whether a good portion of the functionality is being tested (where good depends very much on how much confidence you need to have in the product). Have I ever seen you at soccer? I know some ACUS people go. Are you the Paul who goes? -- Philip Craig philip@research.canon.oz.au Phone:+61 2 8052951 Fax:+61 2 8052929 "Now bid me run, and I will strive with things impossible - yea, and get the better of them." -- W. Shakespeare, JULIUS CAESAR From marick@hal.cs.uiuc.edu@usage.csd.unsw.oz Sat Dec 19 04:33:12 1992 Date: Fri, 18 Dec 1992 11:30:36 -0600 From: Brian Marick <marick@hal.cs.uiuc.edu> Message-Id: <199212181730.AA00537@hal.cs.uiuc.edu> To: paulb@syacus.acus.oz.au Subject: Re: Value of High Code Coverage Metrics in Testing - Request for Opinion Newsgroups: comp.software-eng References: <1992Dec14.072812.13689@syacus.acus.oz.au> Status: RO This may be of interest: A CASE STUDY IN COVERAGE TESTING Brian Marick Testing Foundations Abstract I used a C coverage tool to measure the quality of its own test suite. I wrote new tests to improve the coverage of a 2600-line segment of the tool. I then reused and extended those tests for the next release, which included a complete rewrite of that segment. The experience reinforced my beliefs about coverage-based testing: 1. A thorough test suite should achieve nearly 100% feasible coverage. 2. Adding tests for additional coverage can be cheap and effective. 3. To be effective, testing should not be a blind attempt to achieve coverage. Instead, use coverage as a signal that points to weakly-tested parts of the specification. 4. In addition to suggesting new tests, coverage also tells you when existing tests aren't doing what you think, a common problem. 5. Coverage beyond branch coverage is worthwhile. 6. Even with thorough testing, expect documentation, directed inspections, beta testing, and customers to find bugs, especially design and specification bugs. The Generic Coverage Tool GCT is a freeware coverage tool for C programs, based on the GNU C compiler. It measures these kinds of coverage: - branch coverage (every branch must be taken in both directions) - multi-condition coverage (in 'if (a && b)', both subexpressions must evaluate to true and false). - loop coverage (require loop not to be taken, to be traversed exactly once, and traversed more than once) - relational coverage (require tests for off-by-one errors) - routine entry and call point coverage. - race coverage (extension to routine coverage for multiprocessing) - weak mutation coverage (a research technique) (For more, see [Marick92].) The tool comes with a large regression test suite, developed in parallel with the code, using a "design a little, test a little, code a little" approach, much like that described in [Rettig91]. About half the original development time was spent in test construction (with, I believe, a corresponding reduction in the amount of frantic debugging when problems were found by users - though of course there was some of that). Most of the tests are targetted to particular subsystems, but they are not unit tests. That is, the tests invoke GCT and deduce subsystem correctness by examining GCT's output. Only a few routines are tested in isolation using stubs - that's usually too expensive. When needed, test support code was built into GCT to expose its internal state. In early releases, I had not measured the coverage of GCT's own test suite. However, in planning the 1.3 release, I decided to replace the instrumentation module with two parallel versions. The original module was to be retained for researchers; commercial users would use a different module that wouldn't provide weak mutation coverage but would be superior in other ways. Before redoing the implementation, I wanted the test suite to be solid, because I knew a good test suite would save implementation time. Measuring Coverage I used branch, loop, multi-condition, and relational coverage. I'm not convinced weak mutation coverage is cost-effective. Here were my initial results for the 2617 lines of code I planned to replace. (The count excludes comments, blank lines, and lines including only braces.) BINARY BRANCH INSTRUMENTATION (402 conditions total) 47 (11.69%) not satisfied. 355 (88.31%) fully satisfied. SWITCH INSTRUMENTATION (90 conditions total) 14 (15.56%) not satisfied. 76 (84.44%) fully satisfied. LOOP INSTRUMENTATION (24 conditions total) 5 (20.83%) not satisfied. 19 (79.17%) fully satisfied. MULTIPLE CONDITION INSTRUMENTATION (390 conditions total) 56 (14.36%) not satisfied. 334 (85.64%) fully satisfied. OPERATOR INSTRUMENTATION (45 conditions total) ;; This is relational coverage 7 (15.56%) not satisfied. 38 (84.44%) fully satisfied. SUMMARY OF ALL CONDITION TYPES (951 total) 129 (13.56%) not satisfied. 822 (86.44%) fully satisfied. These coverage numbers are consistent with what I've seen using black box unit testing combined with judicious peeks into the code. (See [Marick91].) I do not target coverage in my test design; it's more important to concentrate on the specification, since many important faults will be due to omitted code [Glass81]. When the uncovered conditions were examined more closely (which took less than an hour), it was clear that the tests were more thorough than appears from the above. The 129 uncovered conditions broke down as follows: 28 were impossible to satisfy (sanity checks, loops with fixed bounds can't be executed 0 times, and so on). 46 were support code for a feature that was never implemented (because it turned out not to be worthwhile); these were also impossible to exercise. 17 were from temporary code, inserted to work around excessive stack growth on embedded systems. It was always expected to be removed, so was not tested. 24 were due to a major feature, added late, that had never had regression tests written for it. 14 conditions corresponded to 10 untested minor features. All in all, the test suite had been pleasingly thorough. New Tests Prior to the Rewrite I spent 4 hours adding tests for the untested major feature. I was careful not to concentrate on merely achieving coverage, but rather on designing tests based on what the program was supposed to do. Coverage is seductive - like all metrics, it is only an approximation of what's important. When "making the numbers" becomes the prime focus, they're often achieved at the expense of what they're supposed to measure. This strategy paid off. I found a bug in handling switches within macros. A test designed solely to achieve coverage would likely have missed the bug. (That is, the uncovered conditions could have been satisfied by an easy - but inadequate - test.) There was another benefit. Experience writing these tests clarified design changes I'd planned to make anyway. Writing tests often has this effect. That's why it's good to design tests (and write user documentation) as early as possible. I spent two more hours testing the minor features. I did not write tests for features that were only relevant to weak mutation. Branch coverage discovered one pseudo-bug: dead code. A particular special case check was incorrect. It was testing a variable against the wrong constant. This check could never be true, so the special case code was never executed. However, the special case code turned out to have the same effect as the normal code, so it was removed. (This fault was most likely introduced during earlier maintenance.) At this point, tests written because of multi-condition, loop, and relational coverage revealed no bugs. My intuitive feel was that the tests were not useless - they checked situations that might well have led to failures, but didn't. I reran the complete test suite overnight and rechecked coverage the next day. One test error was discovered; a typo caused the test to miss checking what it was designed to test. Rechecking took 1/2 hour. Reusing the Test Suite The rewrite of the instrumentation module was primarily a re-implementation of the same specification. All of the test suite could be reused, and there were few new features that required new tests. (I did go ahead and write tests for the weak mutation features I'd earlier ignored.) About 20% of the development time was spent on the test suite (including new tests, revisions to existing tests, and a major reorganization of the suite's directory structure and controlling scripts). The regression test suite found minor coding errors; they always do, in a major revision like this. It found no design flaws. Rewriting the internal documentation (code headers) did. (After I finish code, I go back and revise all the internal documentation. The shift in focus from producing code to explaining it to an imaginary audience invariably suggests improvements, usually simplifications. Since I'm a one-man company, I don't have the luxury of team code reads.) The revised test suite achieved 96.47% coverage. Of 37 unsatisfied conditions: 27 were impossible to satisfy. 2 were impossible to test portably (GNU C extensions). 2 were real (though minor) omissions. 1 was due to a test that had been misplaced in the reorganization. 5 were IF tests that had been made redundant by the rewrite. They were removed. It took an hour to analyse the coverage results and write the needed tests. They found no bugs. Measuring the coverage for the augmented test suite revealed that I'd neglected to add one test file to the test suite's controlling script. Other Tests The 1.3 release also had other features, which were duly tested. For one feature, relational operator coverage forced the discovery of a bug. A coverage condition was impossible to satisfy because the code was wrong. I've found that loop, multi-condition, and relational operator coverage are cheap to satisfy, once you've satisfied branch coverage. This bug was severe enough that it alone justified the time I spent on coverage beyond branch. Impossible conditions due to bugs happen often enough that I believe goals like "85% coverage" are a mistake. The problem with such goals is that you don't look at the remaining 15%, deciding, without evidence, that they're either impossible or not worth satisfying. It's better - and not much more expensive - to decide each case on its merits. What Testing Missed Three bugs were discovered during beta testing, one after release (so far). I'll go into some detail, because they nicely illustrate the types of bugs that testing tends to miss. The first bug was a high level design omission. No testing technique would force its discovery. ("Make sure you instrument routines with a variable number of arguments, compile them with the GNU C compiler, and do that on a machine where GCC uses its own copy of <varargs.h>.") This is exactly the sort of bug that beta testing is all about. Fixing the bug required moderately extensive changes and additions, always dangerous just before a release. Sure enough, the fix contained two bugs of its own (perhaps because I rushed to meet a self-imposed deadline). - The first was a minor design omission. Some helpful code was added to warn GCC users iff they need to worry about <varargs.h>. This code made an assumption that was violated in one case. Coverage would not force a test to detect this bug, which is of the sort that's fixed by changing if (A && B) to if (A && B && C) It would have been nice if GCT would have told me that "condition C, which you should have but don't, was never false", but this is more than a coverage tool can reasonably be expected to do. I found the bug by augmenting the acceptance test suite, which consists of instrumenting and running several large "real" programs. (GCT's test suite contains mostly small programs.) Instrumenting a new real program did the trick. - As part of the original fix, a particular manifest constant had to be replaced by another in some cases. I missed one of the cases. The result was that a few too few bytes of memory were allocated for a buffer and later code could write past the end. Existing tests did indeed force the bytes to be written past the end; however, this didn't cause a failure on my development machine (because the memory allocator rounds up). It did cause a failure on a different machine. Memory allocation bugs, like this one and the next, often slip past testing. [Later note: tools like Cahill's dbmalloc/Sentinal and Pure's Purify can move some of these bugs from the realm of the untestable to the realm of the testable. I now use dbmalloc, and it would have caught this bug.] The final bug was a classic: freeing memory that was not supposed to be freed. None of the tests caused the memory to be reused after freeing, but a real program did. I can envision an implementable type of coverage that would force detection of this bug, but it seems as though a code-read checklist ought to be better. I use such a checklist, but I still missed the bug. References [Rettig91] Marc Rettig, "Practical Programmer: Testing Made Palatable", CACM, May, 1991. [Marick91] Brian Marick, "Experience with the Cost of Different Coverage Goals for Testing", Pacific Northwest Software Quality Conference, 1991. [Marick92] Brian Marick, "A Tutorial Introduction to GCT", "Generic Coverage Tool (GCT) User's Guide", "GCT Troubleshooting", "Using Race Coverage with GCT", and "Using Weak Mutation Coverage with GCT", Testing Foundations, 1992. [Glass81] Robert L. Glass. "Persistent Software Errors", Transactions on Software Engineering", vol. SE-7, No. 2, pp. 162-168, March, 1981. Brian Marick, marick@cs.uiuc.edu, uiucdcs!marick Testing Foundations: Consulting, Training, Tools. From pilchuck!phred!timf@uunet.uu.net@usage.csd.unsw.oz Fri Dec 18 12:33:31 1992 >From pilchuck!phred!timf@uunet.uu.net Wed Dec 16 09:15:43 1992 Message-Id: <m0n2CFL-000FCGC@data-io.com> Date: Wed, 16 Dec 92 17:15:43 -0800 To: pilchuck!paulb@syacus.acus.oz.au Subject: Re: Value of High Code Coverage Metrics in Testing - Request for Opinion From: pilchuck!phred!timf@uunet.uu.net (Tim Farley) Status: RO This is probably getting too you really late since our news here is really slow, but... I think code coverage is a valuable metric, as long as it's not you're only measurement of test completeness. As long as you're doing other things to assure that you have adequate data coverage for your tests, and can show functional coverage as well, you should come out okay. If you haven't already, you might want to check through the IEEE standard for measures to produce reliable software (982.2-1988) for some other coverage measurements. Depending on the tool you're using to calculate code coverage it can be very cost effective. The tool that I've used (TekDB with CCA) prints a variety of reports to show exactly what code has been tested and what hasn't. For engineers, this makes it really easy to get the coverage up since all you do is run some tests, generate the report, see what you haven't hit, then add some more tests for these. We usually start the engineers off with the high level test cases developed from the requirements specifications and design documents, have them run through those and check the coverage, then add in additional tests for what hasn't been hit. It's not difficult, and if time has already been scheduled for the engineer to do testing, it doesn't really add anymore time (assuming a reasonable estimate for testing was given) and can actually decrease the amount of time if they aren't following any method with their testing. I saw someone else posted something that said that 50% or so was a good target. We set our target at 95%, with the expectation that anything less than 100% had to be include a description of why the remaining code had not been tested. Saying you didn't have enough time wasn't accepted. They probability of that code being executed, and the severity of a defect in the functionality directly supported by that code, had to be included. We always got at least 95% coverage, usually higher. With our products, turning on the unit and running very basic functions got about 60-70% coverage. You could get up to 90% or so with the basic error and boundary conditions. The rest were usually very rare hardware faults detected by the software. (I should add that these were all test and measurement devices with embedded software, in case you're wondering.) I don't know what kind of devices you're working on. I'm currently working on medical devices which have a lot more error detection software and redundant systems, and I'm not sure what it would take to get the same amount of coverage. However, I will set 100% as the goal and anything less than that requiring a complete description (which would still have to be approved). What development environment are you working with? Just curious. Tim Farley Sr. SQA Engineer