home *** CD-ROM | disk | FTP | other *** search
-
- Implementation Notes for SFLOAT,
- a software Floating Point package
- for 80X8X processors
-
- By Robert L. Smith
-
- INTRODUCTION:
-
- This document contains various notes, thoughts, methods, etc. which
- relate to the software floating-point package which I have called SFLOAT.
- My main motivation for writing such a package is to help eliminate a common
- complaint about FORTH, that it isn't useful because it doesn't have floating
- point. There are various possible answers to that complaint. One answer is
- that floating point is available in most commercial FORTH systems. On the
- other hand, most people initially try FORTH in a public domain package, such
- as F-83. In that case, they will probably not find a floating point package.
- (One exception is a hardware floating point package for the PC clones.
- Another exception is a low precision "Zen" floating point published by Martin
- Tracy.) Another answer to the complaint is that floating point is not needed.
- In a pure sense, that is certainly true. However, there are many times when
- the human effort required to do a problem in fixed point arithmetic becomes
- unreasonable for the problem at hand. In my particular case, I once wasted a
- great deal of time trying to approximate, in fixed point, a function involving
- the probability integral. The problem was trivial in floating point.
-
- The floating point routines in this package are written for the most
- part in assembly language (CODE) routines for a specific processor type, the
- infamous Intel 8088 family. Why did I write routines in code rather than in
- high level FORTH? For the usual reason of speed. In addition, detailed
- arithmetic functions are somewhat easier to write in code rather than high
- level because of the absence of the carry bit in FORTH. I chose the Intel
- series because of its popularity. An equivalent package for the Motorola
- 68XXX series should be much easier to write because of the 32 bit architect-
- ure. I will leave that to others.
-
- It was my intent to produce a set of floating point routines which
- are both fast and accurate. If you find errors or places where I can improve
- things, please let me know.
-
-
- REPRESENTATION:
-
- I was a member of the IEEE P754 committee on Floating Point. At one
- point I was strongly inclined to use the IEEE floating point representation.
- There were two reasons why I rejected it. The first reason was that I tried,
- and found it very difficult to work with. The second reason is that someone
- else had built a floating point package around the IEEE format (Philip Koopman,
- Jr., MVP FORTH). According to reports from Glenn Haydon, the routines were
- painfully slow.
-
- Every representation method has its advantages and disadvantages.
- I chose a 48 bit representation. The most significant bit is the sign bit.
- The next 15 bits are used to indicate the power of 2, and the remaining 32
- bits are used for the significand (or mantissa, as we used to say), without
- suppressing the leading bit. The binary point is at the left of the leading
- significant bit. This representation gives about 9 1/2 decimal digits of
- accuracy (somewhat greater than the IEEE single precision representation)
- and a range of 10^-4000 to 10^4000, roughly. That is far more than anyone
- needs, but greatly reduces the need to test for overflow and underflow.
- I chose to consider any unnormalized number (with the most significant bit
- cleared) as zero. Finally, I set the bias on the exponent or power of 2 to
- 3FFF (Hex representation, obviously). I am not completely satisfied with the
- last two decisions (representation of zero, and exponent bias), but I don't
- have the energy to re-do all of the work just to see if the decision should
- be altered.
-
- The current version uses a separate floating point stack. In my
- initial version, I used the standard parameter stack to hold working floating
- point numbers. This is allowed in the Forth Vendors Group Standard Floating
- Point Extension. As of this writing, it appears that the ANSI Standard Forth
- will specify a separate Floating Point stack, so I re-wrote the routines.
-
-
- ACCURACY:
-
- In the IEEE Floating Point Standard, the basic four functions (add,
- subtract, multiply, and divide) and square root are required to be exact
- within one-half of the least significant bit. It is my intention to meet
- that requirement. To understand how that is done, the reader should consult
- some of the original papers. See, for example, "The INTEL Standard for
- Floating Point," by John F. Palmer, Proceedings Compsac77, IEEE Catalog No.
- 77CH1291-4C. The basic idea is that all bits below the least significant
- bits are to be considered. If these bits represent a number greater than
- one-half of the least significant bit, then the result is incremented by 1.
- If the bits represent a number less than one-half, the result is unchanged.
- If the bits represent exactly one-half, then the least significant bit is
- examined. If the least significant bit is 0, the result is unchanged.
- Otherwise the result is incremented by 1. This process is called "rounding
- to even." The trickiest part of this process occurs when subtracting two
- numbers which have the same sign (or adding two numbers of opposite sign).
- I know the casual reader of these notes is going to think that going to all
- of this effort is gross overkill, but the mathematical literature has
- numerous cases where this is important. In any case, if you are going off on
- your own to do floating point arithmetic, please do not simply truncate the
- results unless you have very good reasons for doing so. Some of the very
- worst examples come from computers designed by Seymore Cray, whose philosophy
- seems to be that the results don't have to be computed accurately as long as
- they can be computed rapidly.
-
-
- DIVISION
-
- If you are interested in details, the floating point division routine
- in this package is fairly interesting. It is my second version. The initial
- version used a more or less standard technique for dividing by successive
- subtraction and shifting. I stumbled on an interesting variation for Knuth's
- method for divsion in the documentation of the Forth-based language called
- "Abundance" by Roedy Green. In that package, he performs a double precision
- divide by first shifting the divisor left until the most significant bit is
- set, obtaining an unsigned quotient and remainder, then adjusting the result
- based on the initial amount of shifting. In a floating point routine, the
- operands are almost always normalized, so that the shifting is unnecessary,
- except possibly for an initial right shift of the numerator by one place.
-
- The method is somewhat similiar to the technique that you learned in
- grammar school, except that instead of working in base ten we use base "s",
- where s = 2^16 . In other words, each 16 bit word is considered as a digit
- in this new base. We desire a 2 digit quotient and a two digit remainder.
- We will first guess the leading digit of the quotient, then determine its
- exact value, along with an appropriate remainder. Then we will determine the
- second digit and final remainder the same way. It turns out that our initial
- quess for each quotient digit must be no more than 2 greater than the correct
- digit. At each stage the numerator consists of 3 digits, and the denominator
- consists of 2 digits. We may define the first quotient digit and remainder
- as follows:
-
- (s^2)*A + s*B + C r1
- ----------------- = q1 + ------
- sD + E sD + E
-
- The 3 terms in the numerator account for a possible right shift in the case
- that the initial numerator was greater than or equal to the denominator. We
- thus can guarantee that A <= D, and q1 < s. We also require that the
- remainder r1 < (sD + E). With a small amount of algebraic manipulation we
- can see that
-
- r1 = (s^2)*A + s*B + C - s*q1*D - q1*E
-
- That doesn't appear to be of much help, but just hold on. We will make our
- initial guess for q1 based on a single precision division as follows:
-
- s*A + B r0
- ------- = g0 + ---
- D D
-
- The key to simplifying this calculation is that we do not ignore the remainder
- in our trial division. Note that the remainder in this division is
-
- r0 = s*A + B - g0*D
-
- Assume for the time being that q1 = g0. It may not be, but our equations
- will still be exact! We can calculate a remainder r1' which is
-
- r1' = s*r0 + C - g0*E
-
- by making a simple single precision multiply and a double precision
- subtraction. Our initial guess g0 may have been too large. We can tell
- that because then r1' would be negative. If r1' is >= 0, our initial guess
- is the correct guess.
-
- If r1' is negative, we may correct for that by simultaneously adding
- the denominator (sD + E) to r1' and decrementing g0 by 1. We check again to
- see if the new remainder r1'' is negative. If r1'' is also negative, make a
- final correction by adding the denominator to r1'' and decrementing g0 again.
- This final quotient is correct, and the remainder is also correct.
-
- The process is then repeated for the second quotient digit, by using
- the first remainder as the new numerator and then estimating the quotient and
- remainder as above. The result is that we can calculate the double precision
- quotient by using 2 single precision divides and two single precision
- multiplies, and some trivial additions and subtractions.
-
- The final part of this process is to round the quotient precisely.
- This is done by comparing twice the remainder with the divisor. If 2*rem is
- less than the divisor, the quotient is correct. If 2*rem is greater than the
- divisor, the quotient must be incremented by 1. It 2*rem is exactly equal to
- the divisor, we round to even. That is, if the least significant bit in the
- quotient is 1, then add 1 to the quotient. Otherwise, do not add anything to
- the quotient.
-
- There is an additional detail in the above: it is possible that
- A = D , even though (sA + B ) < (sD + E) . If we were to just divide, we
- would encounter an overflow situation. Therefore, consider that the trial
- quotient g0 = s - 1 . The zero-th remainder is r0 = sA + B - (sD + D) ,
- and the first remainder is
-
- r1 = (s^2)*A + s*B + C - (s^2)*D + s*(D - E) + E
- = s*(B + D - E) + (C + E)
-
- Again, the initial guess may be too large, which may be discovered by checking
- if r1 is negative. If it is, then one or two corrections may need to be made,
- as in the previous discussion.
-
-
- LOGARITHM
-
- These notes are for those few souls who are interested in the details
- of implementation of the FLN function for software floating point. As noted
- elsewhere, the basic format for floating point is a 16 bit sign and biased
- exponent, followed by a 32 bit normalized fraction. The biased exponent
- occupies bits 0 through 14 with a bias of 3FFF (hexadecimal), and the sign
- bit is bit 15 of the first 16 bits. The binary point of the fraction is just
- to the left of the 32 bit fractional part (sometimes called the mantissa).
- If the most significant bit of the mantissa is zero, the whole number is
- considered zero.
-
- If the argument to the FLN function is negative or zero, it is an
- error condition. After testing for those conditions, we then break down the
- problem into two parts, depending on the magnitude of the fraction. If the
- fraction is greater than or equal to 0.78125 (and, by definition, it must be
- less than 1.0), then we note that
-
- ln(X) = ln((2^I) * 0.f) = I*ln(2) + ln(0.f)
-
- where I is the unbiased exponent and 0.f is the fractional part. Calculation
- of ln(0.f) is based on the approximation:
-
- ln(1-x) = -(x + (x^2)/2 + (x^3)/3 + (x^4)/4 + . . . )
-
- It may be noted in passing that there are better approximations than the
- above, but this one is selected to allow a crucial division to be made using
- single precision divisors. It is necessary to reduce the range of x in the
- above equation. (If we did not, it would be necessary to use about 24 terms
- in the series when x = 0.25). This is accomplished by dividing 0.f into
- intervals of 2^-5 or 1/32 . The procedure is as follows:
-
- 0.f = 1 - f1 , where f1 = 1 - 0.f
-
- f1 * 2^5 = J + f2 * 2^5
-
- 0.f = 1 - f1 = 1 - (2^-5) * J - f2
-
- = (1 - J*(2^-5)) * [1 - f2/(1 - J*(2^-5))]
-
- where J is an integer. The calculation can proceed by using a table of
- values for the logarithm of (1 - J*(2^-5)), and an approximation for
- ln(1-x), where
-
- x = f2/(1 - J*(2^-5))
-
- In this case, the maximum value for x is 0.0435 . The series approximation
- for 33 bits of accuracy can terminate at the term (x^7)/7 . In order to
- maximize the accuracy of the calculation, we proceed using y = x*2^4 .
- The maximum value of y is 0.696 . We then proceed to calculate:
-
-
- (2^4)*(-ln(1-x)) = y + (y^2)*(2^-5) + (2/3)*(y^3)*(2^-9) +
- + (y^4)*(2^-14) + (4/5)*(y^5)*(2^-18) +
- + (4/6)*(y^6)*(2^-22) + (8/7)*(y^7)*(2^-27)
-
- The last term is kept only for rounding purposes. The factor (8/7) may be
- approximated by 1. It is necessary to use double precision in only the
- first 4 terms (well, maybe only the first 3, with extreme care). The above
- equation should be factored, and the calculations made from right to left so
- that the smallest terms are calculated first. (Writing equations in ASCII
- is tedious, and reading them is probably hellish, at best). The calculation
- is roughly as follows:
-
- -(2^4)*ln(1-x) = (((((y*(2^-5) + (2/3))y*(2^-4) +
- +(4/5))*(y^2)*(2^-4) + Y)*(2^-5) + (2/3))*(Y^3)*(2^-4) +
- +(Y^2))*(2^-5) + Y
-
- where Y indicates the double precision value and y the single precision
- value.
-
- The case of the initial fraction 0.f being less than 0.78125 proceeds
- along somewhat similiar lines to the above. We first note that the desired
- result may be expressed as
-
- ln(X) = ln(2^(I-1)*(2*0.f)) = (I-1)*ln(2) + ln(2*0.f)
-
- where 1.0 <= (2.0*f) < 1.5625 . In a manner similiar to the first case we
- divide the interval up into 19 subintervals of 2^-5 or 1/32 each. The
- proceedure is as follows:
-
- 2*0.f = 1 + f1 , where f1 = (2*0.f) - 1
-
- f1 * (2^5) = J + f2 * (2^5)
-
- 2*0.f = (1 + J*(2^-5)) * [1 + f2/(1 + J*(2^-5))]
-
- where J is an integer. The calculation can proceed by using a table of
- values for the logarithm of (1 + J*(2^-5)) , and an approximation for
- ln(1+x), where
-
- x = f2/(1 + J*(2^-5))
-
- In this case, the maximum value for x is 0.03125 . The approximation
- proceeds much as in the first case. The approximation used is:
-
- ln(1+x) = x - (x^2)/2 + (x^3)/3 - (x^4)/4 + (x^5)/5 - (x^6)/6 + (x^7)/8
-
- Let x = (2^-5)Y . The calculation proceeds along the lines indicated by:
-
- (2^5)*ln(1+x) = Y - (2^-6)(Y + (2^-5)(Y^3)(2/3 - (2^-6)(y - (2^-5)*
- (y^2)(4/5 - (2^-5)y(2/3 - y*(2^-6))))
-
-
-
-
-