ubic.basecode.math.DescriptiveWithMissing

public class DescriptiveWithMissing extends Object

Mathematical functions for statistics that allow missing values without scotching the calculations.

Some functions that come with DoubleArrayLists will not work in an entirely compatible way with missing values. For examples, size() reports the total number of elements, including missing values. To get a count of non-missing values, use this.sizeWithoutMissingValues(). The right one to use may vary.

Not all methods need to be overridden. However, all methods that take a "size" parameter should be passed the results of sizeWithoutMissingValues(data), instead of data.size().

Author:

Paul Pavlidis

See Also:

Method Summary

Modifier and Type

Method

Description

static double

correlation(double[] x, double[] y, double[] selfSquaredX, double[] selfSquaredY, boolean[] nanStatusX, boolean[] nanStatusY)

Highly optimized version of the correlation computation, where as much information is precomputed as possible.

static double

correlation(DoubleArrayList x, DoubleArrayList y)

Calculate the pearson correlation of two arrays.

static double

covariance(DoubleArrayList data1, DoubleArrayList data2)

Returns the SAMPLE covariance of two data sequences.

static double

durbinWatson(DoubleArrayList data)

Durbin-Watson computation.

static double

geometricMean(DoubleArrayList data)

Returns the geometric mean of a data sequence.

static double

kurtosis(DoubleArrayList data, double mean, double standardDeviation)

Returns the kurtosis (aka excess) of a data sequence, which is -3 + moment(data,4,mean) / standardDeviation⁴.

static double

mad(DoubleArrayList data)

Returns the median absolute deviation from the median.

static double

max(DoubleArrayList input)

static double

mean(DoubleArrayList data)

static double

mean(DoubleArrayList x, int effectiveSize)

Special mean calculation where we use the effective size as an input.

static double

meanAboveQuantile(DoubleArrayList data, double quantile)

Calculate the mean of the values above to a particular quantile of an array.

static double

median(DoubleArrayList data)

Returns the median.

static double

min(DoubleArrayList input)

static double

moment(DoubleArrayList data, int k, double c)

Returns the moment of k -th order with constant c of a data sequence, which is Sum( (data[i]-c)^k ) / data.size().

static double

product(DoubleArrayList data)

Returns the product of a data sequence, which is Prod( data[i] ).

static double

quantile(DoubleArrayList data, double phi)

Returns the phi- quantile; that is, an element elem for which holds that phi percent of data elements are less than elem.

static double

quantileInverse(DoubleArrayList data, double element)

Returns how many percent of the elements contained in the receiver are <= element.

static DoubleArrayList

quantiles(DoubleArrayList data, DoubleArrayList percentages)

Returns the quantiles of the specified percentages.

static double

rankInterpolated(DoubleArrayList data, double element)

Returns the linearly interpolated number of elements in a list less or equal to a given element.

static double

sampleKurtosis(DoubleArrayList data, double mean, double sampleVariance)

Returns the sample kurtosis (aka excess) of a data sequence.

static double

sampleSkew(DoubleArrayList data, double mean, double sampleVariance)

Returns the sample skew of a data sequence.

static double

sampleVariance(DoubleArrayList data, double mean)

Returns the sample variance of a data sequence.

static double

skew(DoubleArrayList data, double mean, double standardDeviation)

Returns the skew of a data sequence, which is moment(data,3,mean) / standardDeviation³.

static void

standardize(DoubleArrayList data)

Standardize.

static void

standardize(DoubleArrayList data, double mean, double standardDeviation)

Modifies a data sequence to be standardized.

static double

sum(DoubleArrayList data)

Returns the sum of a data sequence.

static double

sumOfInversions(DoubleArrayList data, int from, int to)

Returns the sum of inversions of a data sequence, which is Sum( 1.0 / data[i]).

static double

sumOfLogarithms(DoubleArrayList data, int from, int to)

Returns the sum of logarithms of a data sequence, which is Sum( Log(data[i]).

static double

sumOfPowerDeviations(DoubleArrayList data, int k, double c)

Returns Sum( (data[i]-c)^k ); optimized for common parameters like c == 0.0 and/or k == -2 .. 4.

static double

sumOfPowerDeviations(DoubleArrayList data, int k, double c, int from, int to)

Returns Sum( (data[i]-c)^k ) for all i = from ..

static double

sumOfPowers(DoubleArrayList data, int k)

Returns the sum of powers of a data sequence, which is Sum ( data[i]^k ).

static double

sumOfSquaredDeviations(DoubleArrayList data)

Compute the sum of the squared deviations from the mean of a data sequence.

static double

sumOfSquares(DoubleArrayList data)

Returns the sum of squares of a data sequence.

static double

trimmedMean(DoubleArrayList data, int left, int right)

Returns the trimmed mean of a data sequence.

static double

variance(DoubleArrayList data)

Provided for convenience!

static double

weightedMean(DoubleArrayList data, DoubleArrayList weights)

Returns the weighted mean of a data sequence.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Method Details
- correlation
  
  public static double correlation(double[] x, double[] y, double[] selfSquaredX, double[] selfSquaredY, boolean[] nanStatusX, boolean[] nanStatusY)
  
  Highly optimized version of the correlation computation, where as much information is precomputed as possible. Use of this method only makes sense if many comparisons with the inputs x and y are being performed.
  Implementation note: In correlation(DoubleArrayList x, DoubleArrayList y), profiling shows that calls to Double.NaN consume half the CPU time. The precomputation of the element-by-element squared values is another obvious optimization. There is also no checking for matching lengths of the arrays.
  
  Parameters:
  
  x - double array containing values of x_i for each X.
  
  y - double array containing values of y_i for each Y.
  
  selfSquaredX - double array containing values of x_i^2 for each X.
  
  selfSquaredY - double array containing values of y_i^2 for each Y
  
  nanStatusX - boolean array containing value of Double.isNaN(double) for each X.
  
  nanStatusY - boolean array containing value of Double.isNaN(double) for each Y.
  
  Returns:
- correlation
  
  public static double correlation(DoubleArrayList x, DoubleArrayList y)
  
  Calculate the pearson correlation of two arrays. Missing values (NaNs) are ignored.
  
  Parameters:
  
  x - DoubleArrayList
  
  y - DoubleArrayList
  
  Returns:
  
  double
- covariance
  
  public static double covariance(DoubleArrayList data1, DoubleArrayList data2)
  
  Returns the SAMPLE covariance of two data sequences. Pairs of values are only considered if both are not NaN. If there are no non-missing pairs, the covariance is zero.
  
  Parameters:
  
  data1 - the first vector
  
  data2 - the second vector
  
  Returns:
  
  double
- durbinWatson
  
  public static double durbinWatson(DoubleArrayList data)
  
  Durbin-Watson computation. This measures the serial correlation in a data series.
  
  Parameters:
  
  data - DoubleArrayList
  
  Returns:
  
  double
- geometricMean
  
  public static double geometricMean(DoubleArrayList data)
  
  Returns the geometric mean of a data sequence. Missing values are ignored. Note that for a geometric mean to be meaningful, the minimum of the data sequence must not be less or equal to zero.
  The geometric mean is given by pow( Product( data[i] ), 1/data.size()). This method tries to avoid overflows at the expense of an equivalent but somewhat slow definition: geo = Math.exp( Sum( Log(data[i]) ) / data.size()).
  
  Parameters:
  
  data - DoubleArrayList
  
  Returns:
  
  double
- kurtosis
  
  public static double kurtosis(DoubleArrayList data, double mean, double standardDeviation)
  
  Returns the kurtosis (aka excess) of a data sequence, which is -3 + moment(data,4,mean) / standardDeviation⁴.
  
  Parameters:
  
  data - DoubleArrayList
  
  mean - double
  
  standardDeviation - double
  
  Returns:
  
  double
- mad
  
  public static double mad(DoubleArrayList data)
  
  Returns the median absolute deviation from the median.
  
  Parameters:
  
  data - the data, does not have to be sorted
- max
  
  public static double max(DoubleArrayList input)
- mean
  
  public static double mean(DoubleArrayList data)
  
  Parameters:
  
  data - Values to be analyzed.
  
  Returns:
  
  Mean of the values in x. Missing values are ignored in the analysis.
- mean
  
  public static double mean(DoubleArrayList x, int effectiveSize)
  
  Special mean calculation where we use the effective size as an input.
  
  Parameters:
  
  x - The data
  
  effectiveSize - The effective size used for the mean calculation.
  
  Returns:
  
  double
- meanAboveQuantile
  
  public static double meanAboveQuantile(DoubleArrayList data, double quantile)
  
  Calculate the mean of the values above to a particular quantile of an array.
  
  Parameters:
  
  data - Array for which we want to get the quantile.
  
  quantile - A value from 0 to 1
  
  Returns:
  
  double
- median
  
  public static double median(DoubleArrayList data)
  
  Returns the median. Missing values are ignored entirely.
  
  Parameters:
  
  data - the data sequence, does not have to be sorted.
  
  Returns:
  
  double
- min
  
  public static double min(DoubleArrayList input)
- moment
  
  public static double moment(DoubleArrayList data, int k, double c)
  
  Returns the moment of k -th order with constant c of a data sequence, which is Sum( (data[i]-c)^k ) / data.size().
  
  Parameters:
  
  data - DoubleArrayList
  
  k - int
  
  c - double
  
  Returns:
  
  double
- product
  
  public static double product(DoubleArrayList data)
  
  Returns the product of a data sequence, which is Prod( data[i] ). Missing values are ignored. In other words: data[0]*data[1]*...*data[data.size()-1]. Note that you may easily get numeric overflows.
  
  Parameters:
  
  data - DoubleArrayList
  
  Returns:
  
  double
- quantile
  
  public static double quantile(DoubleArrayList data, double phi)
  
  Returns the phi- quantile; that is, an element elem for which holds that phi percent of data elements are less than elem. Missing values are ignored. The quantile need not necessarily be contained in the data sequence, it can be a linear interpolation.
  
  Parameters:
  
  data - the data sequence, does not have to be sorted.
  
  phi - the percentage; must satisfy 0 <= phi <= 1.
  
  Returns:
  
  double
- quantileInverse
  
  public static double quantileInverse(DoubleArrayList data, double element)
  
  Returns how many percent of the elements contained in the receiver are <= element. Does linear interpolation if the element is not contained but lies in between two contained elements. Missing values are ignored.
  
  Parameters:
  
  data - the list to be searched
  
  element - the element to search for.
  
  Returns:
  
  the percentage phi of elements <= element(0.0 <= phi <= 1.0).
- quantiles
  
  public static DoubleArrayList quantiles(DoubleArrayList data, DoubleArrayList percentages)
  
  Returns the quantiles of the specified percentages. The quantiles need not necessarily be contained in the data sequence, it can be a linear interpolation.
  
  Parameters:
  
  data - the data sequence; does not have to be sorted
  
  percentages - the percentages for which quantiles are to be computed. Each percentage must be in the interval [0.0,1.0].
  
  Returns:
  
  the quantiles.
- rankInterpolated
  
  public static double rankInterpolated(DoubleArrayList data, double element)
  
  Returns the linearly interpolated number of elements in a list less or equal to a given element. Missing values are ignored. The rank is the number of elements invalid input: '<'= element. Ranks are of the form {0, 1, 2,..., sortedList.size()}. If no element is invalid input: '<'= element, then the rank is zero. If the element lies in between two contained elements, then linear interpolation is used and a non integer value is returned.
  
  Parameters:
  
  data - the list to be searched, does not have to be sorted
  
  element - the element to search for.
  
  Returns:
  
  the rank of the element.
- sampleKurtosis
  
  public static double sampleKurtosis(DoubleArrayList data, double mean, double sampleVariance)
  
  Returns the sample kurtosis (aka excess) of a data sequence.
  
  Parameters:
  
  data - DoubleArrayList
  
  mean - double
  
  sampleVariance - double
  
  Returns:
  
  double
- sampleSkew
  
  public static double sampleSkew(DoubleArrayList data, double mean, double sampleVariance)
  
  Returns the sample skew of a data sequence.
  
  Parameters:
  
  data - DoubleArrayList
  
  mean - double
  
  sampleVariance - double
  
  Returns:
  
  double
- sampleVariance
  
  public static double sampleVariance(DoubleArrayList data, double mean)
  
  Returns the sample variance of a data sequence. That is Sum ( (data[i]-mean)^2 ) / (data.size()-1).
  
  Parameters:
  
  data - DoubleArrayList
  
  mean - double
  
  Returns:
  
  double
- skew
  
  public static double skew(DoubleArrayList data, double mean, double standardDeviation)
  
  Returns the skew of a data sequence, which is moment(data,3,mean) / standardDeviation³.
  
  Parameters:
  
  data - DoubleArrayList
  
  mean - double
  
  standardDeviation - double
  
  Returns:
  
  double
- standardize
  
  public static void standardize(DoubleArrayList data)
  
  Standardize. Note that this does something slightly different than standardize in the superclass, because our sampleStandardDeviation does not use the correction of the superclass (which isn't really standard).
  
  Parameters:
  
  data - DoubleArrayList
- standardize
  
  public static void standardize(DoubleArrayList data, double mean, double standardDeviation)
  
  Modifies a data sequence to be standardized. Missing values are ignored. Changes each element data[i] as follows: data[i] = (data[i]-mean)/standardDeviation unless the standard deviation is 0.00 or very close to zero (indicating the data are constant), in which case we return a vector of zeros (in effect just doing mean subtraction)
  
  Parameters:
  
  data - DoubleArrayList
  
  mean - mean of data
  
  standardDeviation - stdev of data. |stdev| invalid input: '<' Constants.TINY is treated as zero.
- sum
  
  public static double sum(DoubleArrayList data)
  
  Returns the sum of a data sequence. That is Sum( data[i] ).
  
  Parameters:
  
  data - DoubleArrayList
  
  Returns:
  
  double
- sumOfInversions
  
  public static double sumOfInversions(DoubleArrayList data, int from, int to)
  
  Returns the sum of inversions of a data sequence, which is Sum( 1.0 / data[i]).
  
  Parameters:
  
  data - the data sequence.
  
  from - the index of the first data element (inclusive).
  
  to - the index of the last data element (inclusive).
  
  Returns:
  
  double
- sumOfLogarithms
  
  public static double sumOfLogarithms(DoubleArrayList data, int from, int to)
  
  Returns the sum of logarithms of a data sequence, which is Sum( Log(data[i]). Missing values are ignored.
  
  Parameters:
  
  data - the data sequence.
  
  from - the index of the first data element (inclusive).
  
  to - the index of the last data element (inclusive).
  
  Returns:
  
  double
- sumOfPowerDeviations
  
  public static double sumOfPowerDeviations(DoubleArrayList data, int k, double c)
  
  Returns Sum( (data[i]-c)^k ); optimized for common parameters like c == 0.0 and/or k == -2 .. 4.
  
  Parameters:
  
  data - DoubleArrayList
  
  k - int
  
  c - double
  
  Returns:
  
  double
- sumOfPowerDeviations
  
  public static double sumOfPowerDeviations(DoubleArrayList data, int k, double c, int from, int to)
  
  Returns Sum( (data[i]-c)^k ) for all i = from .. to; optimized for common parameters like c == 0.0 and/or k == -2 .. 5. Missing values are ignored.
  
  Parameters:
  
  data - DoubleArrayList
  
  k - int
  
  c - double
  
  from - int
  
  to - int
  
  Returns:
  
  double
- sumOfPowers
  
  public static double sumOfPowers(DoubleArrayList data, int k)
  
  Returns the sum of powers of a data sequence, which is Sum ( data[i]^k ).
  
  Parameters:
  
  data - DoubleArrayList
  
  k - int
  
  Returns:
  
  double
- sumOfSquaredDeviations
  
  public static double sumOfSquaredDeviations(DoubleArrayList data)
  
  Compute the sum of the squared deviations from the mean of a data sequence. Missing values are ignored.
  
  Parameters:
  
  data - DoubleArrayList
  
  Returns:
  
  double
- sumOfSquares
  
  public static double sumOfSquares(DoubleArrayList data)
  
  Returns the sum of squares of a data sequence. Skips missing values.
  
  Parameters:
  
  data - DoubleArrayList
  
  Returns:
  
  double
- trimmedMean
  
  public static double trimmedMean(DoubleArrayList data, int left, int right)
  
  Returns the trimmed mean of a data sequence. Missing values are completely ignored.
  
  Parameters:
  
  data - the data sequence
  
  left - int the number of leading elements to trim.
  
  right - int number of trailing elements to trim.
  
  Returns:
  
  double
- variance
  
  public static double variance(DoubleArrayList data)
  
  Provided for convenience!
  
  Parameters:
  
  data - DoubleArrayList
  
  Returns:
  
  double
- weightedMean
  
  public static double weightedMean(DoubleArrayList data, DoubleArrayList weights)
  
  Returns the weighted mean of a data sequence. That is Sum (data[i] * weights[i]) / Sum ( weights[i] ).
  
  Parameters:
  
  data - DoubleArrayList
  
  weights - DoubleArrayList
  
  Returns:
  
  double

Class DescriptiveWithMissing

Method Summary

Methods inherited from class java.lang.Object

Method Details

correlation

correlation

covariance

durbinWatson

geometricMean

kurtosis

mad

max

mean

mean

meanAboveQuantile

median

min

moment

product

quantile

quantileInverse

quantiles

rankInterpolated

sampleKurtosis

sampleSkew

sampleVariance

skew

standardize

standardize

sum

sumOfInversions

sumOfLogarithms

sumOfPowerDeviations

sumOfPowerDeviations

sumOfPowers

sumOfSquaredDeviations

sumOfSquares

trimmedMean

variance

weightedMean