home *** CD-ROM | disk | FTP | other *** search
-
- <HTML>
- <HEAD>
- <TITLE>Statistics::Descriptive - Module of basic descriptive statistical functions.</TITLE>
- <LINK REL="stylesheet" HREF="../../../Active.css" TYPE="text/css">
- <LINK REV="made" HREF="mailto:">
- </HEAD>
-
- <BODY>
- <TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 WIDTH=100%>
- <TR><TD CLASS=block VALIGN=MIDDLE WIDTH=100% BGCOLOR="#cccccc">
- <STRONG><P CLASS=block> Statistics::Descriptive - Module of basic descriptive statistical functions.</P></STRONG>
- </TD></TR>
- </TABLE>
-
- <A NAME="__index__"></A>
- <!-- INDEX BEGIN -->
-
- <UL>
-
- <LI><A HREF="#name">NAME</A></LI><LI><A HREF="#supportedplatforms">SUPPORTED PLATFORMS</A></LI>
-
- <LI><A HREF="#synopsis">SYNOPSIS</A></LI>
- <LI><A HREF="#description">DESCRIPTION</A></LI>
- <LI><A HREF="#methods">METHODS</A></LI>
- <UL>
-
- <LI><A HREF="#sparse methods">Sparse Methods</A></LI>
- <LI><A HREF="#full methods">Full Methods</A></LI>
- </UL>
-
- <LI><A HREF="#reporting errors">REPORTING ERRORS</A></LI>
- <LI><A HREF="#references">REFERENCES</A></LI>
- <LI><A HREF="#copyright">COPYRIGHT</A></LI>
- <LI><A HREF="#revision history">REVISION HISTORY</A></LI>
- </UL>
- <!-- INDEX END -->
-
- <HR>
- <P>
- <H1><A NAME="name">NAME</A></H1>
- <P>Statistics::Descriptive - Module of basic descriptive statistical functions.</P>
- <P>
- <HR>
- <H1><A NAME="supportedplatforms">SUPPORTED PLATFORMS</A></H1>
- <UL>
- <LI>Linux</LI>
- <LI>Solaris</LI>
- <LI>Windows</LI>
- </UL>
- <HR>
- <H1><A NAME="synopsis">SYNOPSIS</A></H1>
- <PRE>
- use Statistics::Descriptive;
- $stat = Statistics::Descriptive::Full->new();
- $stat->add_data(1,2,3,4); $mean = $stat->mean();
- $var = $stat->variance();
- $tm = $stat->trimmed_mean(.25);
- $Statistics::Descriptive::Tolerance = 1e-10;</PRE>
- <P>
- <HR>
- <H1><A NAME="description">DESCRIPTION</A></H1>
- <P>This module provides basic functions used in descriptive statistics.
- It has an object oriented design and supports two different types of
- data storage and calculation objects: sparse and full. With the sparse
- method, none of the data is stored and only a few statistical measures
- are available. Using the full method, the entire data set is retained
- and additional functions are available.</P>
- <P>Whenever a division by zero may occur, the denominator is checked to be
- greater than the value <CODE>$Statistics::Descriptive::Tolerance</CODE>, which
- defaults to 0.0. You may want to change this value to some small
- positive value such as 1e-24 in order to obtain error messages in case
- of very small denominators.</P>
- <P>
- <HR>
- <H1><A NAME="methods">METHODS</A></H1>
- <P>
- <H2><A NAME="sparse methods">Sparse Methods</A></H2>
- <DL>
- <DT><STRONG><A NAME="item_new">$stat = Statistics::Descriptive::Sparse->new();</A></STRONG><BR>
- <DD>
- Create a new sparse statistics object.
- <P></P>
- <DT><STRONG><A NAME="item_add_data">$stat->add_data(1,2,3);</A></STRONG><BR>
- <DD>
- Adds data to the statistics variable. The cached statistical values are
- updated automatically.
- <P></P>
- <DT><STRONG><A NAME="item_count">$stat->count();</A></STRONG><BR>
- <DD>
- Returns the number of data items.
- <P></P>
- <DT><STRONG><A NAME="item_mean">$stat->mean();</A></STRONG><BR>
- <DD>
- Returns the mean of the data.
- <P></P>
- <DT><STRONG><A NAME="item_sum">$stat->sum();</A></STRONG><BR>
- <DD>
- Returns the sum of the data.
- <P></P>
- <DT><STRONG><A NAME="item_variance">$stat->variance();</A></STRONG><BR>
- <DD>
- Returns the variance of the data. Division by n-1 is used.
- <P></P>
- <DT><STRONG><A NAME="item_standard_deviation">$stat->standard_deviation();</A></STRONG><BR>
- <DD>
- Returns the standard deviation of the data. Division by n-1 is used.
- <P></P>
- <DT><STRONG><A NAME="item_min">$stat->min();</A></STRONG><BR>
- <DD>
- Returns the minimum value of the data set.
- <P></P>
- <DT><STRONG><A NAME="item_mindex">$stat->mindex();</A></STRONG><BR>
- <DD>
- Returns the index of the minimum value of the data set.
- <P></P>
- <DT><STRONG><A NAME="item_max">$stat->max();</A></STRONG><BR>
- <DD>
- Returns the maximum value of the data set.
- <P></P>
- <DT><STRONG><A NAME="item_maxdex">$stat->maxdex();</A></STRONG><BR>
- <DD>
- Returns the index of the maximum value of the data set.
- <P></P>
- <DT><STRONG><A NAME="item_sample_range">$stat->sample_range();</A></STRONG><BR>
- <DD>
- Returns the sample range (max - min) of the data set.
- <P></P></DL>
- <P>
- <H2><A NAME="full methods">Full Methods</A></H2>
- <DL>
- <DT><STRONG>$stat = Statistics::Descriptive::Full->new();</STRONG><BR>
- <DD>
- Create a new statistics object that inherits from
- Statistics::Descriptive::Sparse so that it contains all the methods
- described above.
- <P></P>
- <DT><STRONG>$stat->add_data(1,2,4,5);</STRONG><BR>
- <DD>
- Adds data to the statistics variable. All of the sparse statistical
- values are updated and cached. Cached values from Full methods are
- deleted since they are no longer valid.
- <P><EM>Note: Calling add_data with an empty array will delete all of your
- Full method cached values!</EM></P>
- <P></P>
- <DT><STRONG><A NAME="item_get_data">$stat->get_data();</A></STRONG><BR>
- <DD>
- Returns a copy of the data array.
- <P></P>
- <DT><STRONG><A NAME="item_sort_data">$stat->sort_data();</A></STRONG><BR>
- <DD>
- Sort the stored data and update the mindex and maxdex methods. This
- method uses perl's internal sort.
- <P></P>
- <DT><STRONG><A NAME="item_presorted">$stat->presorted(1);</A></STRONG><BR>
- <DD>
- <DT><STRONG>$stat->presorted();</STRONG><BR>
- <DD>
- If called with a non-zero argument, this method sets a flag that says
- the data is already sorted and need not be sorted again. Since some of
- the methods in this class require sorted data, this saves some time.
- If you supply sorted data to the object, call this method to prevent
- the data from being sorted again. The flag is cleared whenever add_data
- is called. Calling the method without an argument returns the value of
- the flag.
- <P></P>
- <DT><STRONG><A NAME="item_percentile">$x = $stat->percentile(25);</A></STRONG><BR>
- <DD>
- <DT><STRONG>($x, $index) = $stat->percentile(25);</STRONG><BR>
- <DD>
- Sorts the data and returns the value that corresponds to the
- percentile as defined in RFC2330:
- <PRE>
- For example, given the 6 measurements:</PRE>
- <PRE>
- -2, 7, 7, 4, 18, -5</PRE>
- <PRE>
- Then F(-8) = 0, F(-5) = 1/6, F(-5.0001) = 0, F(-4.999) = 1/6, F(7) =
- 5/6, F(18) = 1, F(239) = 1.</PRE>
- <PRE>
- Note that we can recover the different measured values and how many
- times each occurred from F(x) -- no information regarding the range
- in values is lost. Summarizing measurements using histograms, on the
- other hand, in general loses information about the different values
- observed, so the EDF is preferred.</PRE>
- <PRE>
- Using either the EDF or a histogram, however, we do lose information
- regarding the order in which the values were observed. Whether this
- loss is potentially significant will depend on the metric being
- measured.</PRE>
- <PRE>
- We will use the term "percentile" to refer to the smallest value of x
- for which F(x) >= a given percentage. So the 50th percentile of the
- example above is 4, since F(4) = 3/6 = 50%; the 25th percentile is
- -2, since F(-5) = 1/6 < 25%, and F(-2) = 2/6 >= 25%; the 100th
- percentile is 18; and the 0th percentile is -infinity, as is the 15th
- percentile.</PRE>
- <PRE>
- Care must be taken when using percentiles to summarize a sample,
- because they can lend an unwarranted appearance of more precision
- than is really available. Any such summary must include the sample
- size N, because any percentile difference finer than 1/N is below the
- resolution of the sample.</PRE>
- <P>taken from:
- RFC2330 - Framework for IP Performance Metrics,
- Section 11.3. Defining Statistical Distributions</P>
- <P>rfc2330 is available from:
- <A HREF="http://www.cis.ohio-state.edu/htbin/rfc/rfc2330.html">http://www.cis.ohio-state.edu/htbin/rfc/rfc2330.html</A></P>
- <P>If the percentile method is called in a list context then it will
- also return the index of the percentile.</P>
- <P></P>
- <DT><STRONG><A NAME="item_median">$stat->median();</A></STRONG><BR>
- <DD>
- Sorts the data and returns the median value of the data.
- <P></P>
- <DT><STRONG><A NAME="item_harmonic_mean">$stat->harmonic_mean();</A></STRONG><BR>
- <DD>
- Returns the harmonic mean of the data. Since the mean is undefined
- if any of the data are zero or if the sum of the reciprocals is zero,
- it will return undef for both of those cases.
- <P></P>
- <DT><STRONG><A NAME="item_geometric_mean">$stat->geometric_mean();</A></STRONG><BR>
- <DD>
- Returns the geometric mean of the data.
- <P></P>
- <DT><STRONG><A NAME="item_mode">$stat->mode();</A></STRONG><BR>
- <DD>
- Returns the mode of the data.
- <P></P>
- <DT><STRONG><A NAME="item_trimmed_mean">$stat->trimmed_mean(ltrim[,utrim]);</A></STRONG><BR>
- <DD>
- <A HREF="#item_trimmed_mean"><CODE>trimmed_mean(ltrim)</CODE></A> returns the mean with a fraction <CODE>ltrim</CODE>
- of entries at each end dropped. <A HREF="#item_trimmed_mean"><CODE>trimmed_mean(ltrim,utrim)</CODE></A>
- returns the mean after a fraction <CODE>ltrim</CODE> has been removed from the
- lower end of the data and a fraction <CODE>utrim</CODE> has been removed from the
- upper end of the data. This method sorts the data before beginning
- to analyze it.
- <P></P>
- <DT><STRONG><A NAME="item_frequency_distribution">$stat->frequency_distribution();</A></STRONG><BR>
- <DD>
- <A HREF="#item_frequency_distribution"><CODE>frequency_distribution(partitions)</CODE></A> slices the data into <CODE>partition</CODE>
- sets (where partition is greater than 1) and counts the number of items
- that fall into each partition. It returns an associative array where
- the keys are the numerical values of the partitions used. The minimum
- value of the data set is not a key and the maximum value of the data
- set is always a key. The number of entries for a particular partition
- key are the number of items which are greater than the previous
- partition key and less then or equal to the current partition key. As
- an example,
- <PRE>
- $stat->add_data(1,1.5,2,2.5,3,3.5,4);
- %f = $stat->frequency_distribution(2);
- for (sort {$a <=> $b} keys %f) {
- print "key = $_, count = $f{$_}\n";
- }</PRE>
- <P>prints</P>
- <PRE>
- key = 2.5, count = 4
- key = 4, count = 3</PRE>
- <P>since there are four items less than or equal to 2.5, and 3 items
- greater than 2.5 and less than 4.</P>
- <P></P>
- <DT><STRONG><A NAME="item_least_squares_fit">$stat->least_squares_fit();</A></STRONG><BR>
- <DD>
- <DT><STRONG>$stat->least_squares_fit(@x);</STRONG><BR>
- <DD>
- <A HREF="#item_least_squares_fit"><CODE>least_squares_fit()</CODE></A> performs a least squares fit on the data,
- assuming a domain of <CODE>@x</CODE> or a default of 1..$stat->count(); It
- returns an array of four elements <CODE>($q, $m, $r, $rms)</CODE> where
- <DL>
- <DT><STRONG><A NAME="item_%24q_and_%24m"><CODE>$q and $m</CODE></A></STRONG><BR>
- <DD>
- satisfy the equation C($y = $m*$x + $q).
- <P></P>
- <DT><STRONG><A NAME="item_%24r"><CODE>$r</CODE></A></STRONG><BR>
- <DD>
- is the Pearson linear correlation cofficient.
- <P></P>
- <DT><STRONG><A NAME="item_%24rms"><CODE>$rms</CODE></A></STRONG><BR>
- <DD>
- is the root-mean-square error.
- <P></P></DL>
- <P>If case of error or division by zero, the empty list is returned.</P>
- <P>The array that is returned can be ``coerced'' into a hash structure
- by doing the following:</P>
- <PRE>
- my %hash = ();
- @hash{'q', 'm', 'r', 'err'} = $stat->least_squares_fit();</PRE>
- </DL>
- <P>
- <HR>
- <H1><A NAME="reporting errors">REPORTING ERRORS</A></H1>
- <P>I read 4 of the 5 perl newsgroups
- comp.lang.perl.{misc,moderated,modules,announce} and check my email at
- work frequently, so please feel free to post errors to either or both
- of those places. However, realize that if you post to the newsgroup it
- has the benefit of alerting other users of the problem. When reporting
- errors, please include the following to help me out:</P>
- <UL>
- <LI>
- Your version of perl. This can be obtained by typing perl <CODE>-v</CODE> at
- the command line.
- <P></P>
- <LI>
- Which version of Statistics::Descriptive you're using. As you can
- see below, I do make mistakes. Unfortunately for me, right now
- there are thousands of CD's with the version of this module with
- the bugs in it. Fortunately for you, I'm a very patient module
- maintainer.
- <P></P>
- <LI>
- Details about what the error is. Try to narrow down the scope
- of the problem and send me code that I can run to verify and
- track it down.
- <P></P></UL>
- <P>My email address can be found at www.perl.com under Who's Who.</P>
- <P>
- <HR>
- <H1><A NAME="references">REFERENCES</A></H1>
- <P>RFC2330, Framework for IP Performance Metrics</P>
- <P>The Art of Computer Programming, Volume 2, Donald Knuth.</P>
- <P>Handbook of Mathematica Functions, Milton Abramowitz and Irene Stegun.</P>
- <P>Probability and Statistics for Engineering and the Sciences, Jay Devore.</P>
- <P>
- <HR>
- <H1><A NAME="copyright">COPYRIGHT</A></H1>
- <P>Copyright (c) 1997,1998 Colin Kuskie. All rights reserved. This
- program is free software; you can redistribute it and/or modify it
- under the same terms as Perl itself.</P>
- <P>Copyright (c) 1998 Andrea Spinelli. All rights reserved. This program
- is free software; you can redistribute it and/or modify it under the
- same terms as Perl itself.</P>
- <P>Copyright (c) 1994,1995 Jason Kastner. All rights
- reserved. This program is free software; you can redistribute it
- and/or modify it under the same terms as Perl itself.</P>
- <P>
- <HR>
- <H1><A NAME="revision history">REVISION HISTORY</A></H1>
- <DL>
- <DT><STRONG><A NAME="item_v2%2E3">v2.3</A></STRONG><BR>
- <DD>
- Rolled into November 1998
- <P>Code provided by Andrea Spinelli to prevent division by zero and to
- make consistent return values for undefined behavior. Andrea also
- provided a test bench for the module.</P>
- <P>A bug fix for the calculation of frequency distributions. Thanks to Nick
- Tolli for alerting this to me.</P>
- <P>Added 4 lines of code to Makefile.PL to make it easier for the ActiveState
- installation tool to use. Changes work fine in perl5.004_04, haven't
- tested them under perl5.005xx yet.</P>
- <P></P>
- <DT><STRONG><A NAME="item_v2%2E2">v2.2</A></STRONG><BR>
- <DD>
- Rolled into March 1998.
- <P>Fixed problem with sending 0's and -1's as data. The old 0 : true ? false
- thing. Use defined to fix.</P>
- <P>Provided a fix for AUTOLOAD/DESTROY/Carp bug. Very strange.</P>
- <P></P>
- <DT><STRONG><A NAME="item_v2%2E1">v2.1</A></STRONG><BR>
- <DD>
- August 1997
- <P>Fixed errors in statistics algorithms caused by changing the
- interface.</P>
- <P></P>
- <DT><STRONG><A NAME="item_v2%2E0">v2.0</A></STRONG><BR>
- <DD>
- August 1997
- <P>Fixed errors in removing cached values (they weren't being removed!)
- and added sort_data and presorted methods.</P>
- <P>June 1997</P>
- <P>Transferred ownership of the module from Jason to Colin.</P>
- <P>Rewrote OO interface, modified function distribution, added mindex,
- maxdex.</P>
- <P></P>
- <DT><STRONG><A NAME="item_v1%2E1">v1.1</A></STRONG><BR>
- <DD>
- April 1995
- <P>Added LeastSquaresFit and FrequencyDistribution.</P>
- <P></P>
- <DT><STRONG><A NAME="item_v1%2E0">v1.0</A></STRONG><BR>
- <DD>
- March 1995
- <P>Released to comp.lang.perl and placed on archive sites.</P>
- <P></P>
- <DT><STRONG><A NAME="item_v%2E20">v.20</A></STRONG><BR>
- <DD>
- December 1994
- <P>Complete rewrite after extensive and invaluable e-mail
- correspondence with Anno Siegel.</P>
- <P></P>
- <DT><STRONG><A NAME="item_v%2E10">v.10</A></STRONG><BR>
- <DD>
- December 1994
- <P>Initital concept, released to perl5-porters list.</P>
- </DL>
- <TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 WIDTH=100%>
- <TR><TD CLASS=block VALIGN=MIDDLE WIDTH=100% BGCOLOR="#cccccc">
- <STRONG><P CLASS=block> Statistics::Descriptive - Module of basic descriptive statistical functions.</P></STRONG>
- </TD></TR>
- </TABLE>
-
- </BODY>
-
- </HTML>
-