We have implemented a highly efficient Wilcoxon-Mann-Whitney rank sum test for high-throughput expression profiling data. For datasets with more than 100 features (genes), the function can be more than 1,000 times faster than its R implementations (wilcox.test in stats, or rankSumTestWithCorrelation in limma).

wmwTest(
  x,
  indexList,
  col = "GeneSymbol",
  valType = c("p.greater", "p.less", "p.two.sided", "U", "abs.log10p.greater",
    "log10p.less", "abs.log10p.two.sided", "Q"),
  simplify = TRUE
)

# S4 method for matrix,IndexList
wmwTest(x, indexList, valType = "p.greater", simplify = TRUE)

# S4 method for numeric,IndexList
wmwTest(x, indexList, valType = "p.greater", simplify = TRUE)

# S4 method for matrix,GmtList
wmwTest(x, indexList, valType = "p.greater", simplify = TRUE)

# S4 method for eSet,GmtList
wmwTest(
  x,
  indexList,
  col = "GeneSymbol",
  valType = "p.greater",
  simplify = TRUE
)

# S4 method for eSet,numeric
wmwTest(
  x,
  indexList,
  col = "GeneSymbol",
  valType = "p.greater",
  simplify = TRUE
)

# S4 method for eSet,logical
wmwTest(
  x,
  indexList,
  col = "GeneSymbol",
  valType = "p.greater",
  simplify = TRUE
)

# S4 method for eSet,list
wmwTest(
  x,
  indexList,
  col = "GeneSymbol",
  valType = "p.greater",
  simplify = TRUE
)

# S4 method for ANY,numeric
wmwTest(x, indexList, valType = "p.greater", simplify = TRUE)

# S4 method for ANY,logical
wmwTest(x, indexList, valType = "p.greater", simplify = TRUE)

# S4 method for ANY,list
wmwTest(x, indexList, valType = "p.greater", simplify = TRUE)

# S4 method for matrix,SignedIndexList
wmwTest(x, indexList, valType, simplify = TRUE)

# S4 method for matrix,SignedGenesets
wmwTest(x, indexList, valType, simplify = TRUE)

# S4 method for numeric,SignedIndexList
wmwTest(x, indexList, valType, simplify = TRUE)

# S4 method for eSet,SignedIndexList
wmwTest(x, indexList, valType, simplify = TRUE)

# S4 method for eSet,SignedGenesets
wmwTest(
  x,
  indexList,
  col = "GeneSymbol",
  valType = c("p.greater", "p.less", "p.two.sided", "U", "abs.log10p.greater",
    "log10p.less", "abs.log10p.two.sided", "Q"),
  simplify = TRUE
)

Arguments

x

A numeric matrix. All other data types (e.g. numeric vectors or ExpressionSet objects) are coerced into matrix.

indexList

A list of integer indices (starting from 1) indicating signature genes. Can be of length zero. Other data types (e.g. a list of numeric or logical vectors, or a numeric or logical vector) are coerced into such a list. See details below for a special case using GMT files.

col

a string sometimes used with a eSet

valType

The value type to be returned, allowed values include p.greater, p.less, abs.log10p.greater and abs.log10p.less (one-sided tests),p.two.sided, and U statistic, and their log10 transformation variants. See details below.

simplify

Logical. If not, the returning value is in matrix format; if set to TRUE, the results are simplified into vectors when possible (default).

Value

A numeric matrix or vector containing the statistic.

Details

The basic application of the function is to test the enrichment of gene sets in expression profiling data or differentially expressed data (the matrix with feature/gene in rows and samples in columns).

A special case is when x is an eSet object (e.g. ExpressionSet), and indexList is a list returned from readGmt function. In this case, the only requirement is that one column named GeneSymbol in the featureData contain gene symbols used in the GMT file. The same applies to signed Gmt files. See the example below.

Besides the conventional value types such as ‘p.greater’, ‘p.less’, ‘p.two.sided’ , and ‘U’ (the U-statistic), wmwTest (from version 0.99-1) provides further value types: abs.log10p.greater and log10p.less perform log10 transformation on respective p-values and give the transformed value a proper sign (positive for greater than, and negative for less than); abs.log10p.two.sided transforms two-sided p-values to non-negative values; and Q score reports absolute log10-transformation of p-value of the two-side variant, and gives a proper sign to it, depending on whether it is rather greater than (positive) or less than (negative).

From version 1.19.1, the rank-biserial correlation coefficient (‘r’) and the common language effect size (‘f’) are supported value types.

Methods (by class)

  • x = matrix,indexList = IndexList: x is a matrix and indexList is a IndexList

  • x = numeric,indexList = IndexList: x is a numeric and indexList is a IndexList

  • x = matrix,indexList = GmtList: x is a matrix and indexList is a GmtList

  • x = eSet,indexList = GmtList: x is a eSet and indexList is a GmtList

  • x = eSet,indexList = numeric: x is a eSet and indexList is a numeric

  • x = eSet,indexList = logical: x is a eSet and indexList is a logical

  • x = eSet,indexList = list: x is a eSet and indexList is a list

  • x = ANY,indexList = numeric: x is ANY and indexList is a numeric

  • x = ANY,indexList = logical: x is ANY and indexList is a logical

  • x = ANY,indexList = list: x is ANY and indexList is a list

  • x = matrix,indexList = SignedIndexList: x is a matrix and indexList is a SignedIndexList

  • x = matrix,indexList = SignedGenesets: x is a eSet and indexList is a SignedIndexList

  • x = numeric,indexList = SignedIndexList: x is a numeric and indexList is a SignedIndexList

  • x = eSet,indexList = SignedIndexList: x is a eSet and indexList is a SignedIndexList

  • x = eSet,indexList = SignedGenesets: x is a eSet and indexList is a SignedIndexList

Note

The function has been optimized for expression profiling data. It avoids repetitive ranking of data as done by native R implementations and uses efficient C code to increase the performance and control memory use. Simulation studies using expression profiles of 22000 genes in 2000 samples and 200 gene sets suggested that the C implementation can be >1000 times faster than the R implementation. And it is possible to further accelerate by parallel calling the function with mclapply in the multicore package.

References

Barry, W.T., Nobel, A.B., and Wright, F.A. (2008). A statistical framework for testing functional namespaces in microarray data. _Annals of Applied Statistics_ 2, 286-315.

Wu, D, and Smyth, GK (2012). Camera: a competitive gene set test accounting for inter-gene correlation. _Nucleic Acids Research_ 40(17):e133

Zar, JH (1999). _Biostatistical Analysis 4th Edition_. Prentice-Hall International, Upper Saddle River, New Jersey.

See also

codewilcox.test in the stats package, and rankSumTestWithCorrelation in the limma package.

Author

Jitao David Zhang <jitao_david.zhang@roche.com>

Examples

## R-native data structures set.seed(1887) rd <- rnorm(1000) rl <- sample(c(TRUE, FALSE), 1000, replace=TRUE) wmwTest(rd, rl, valType="p.two.sided")
#> [1] 0.8535404
wmwTest(rd, which(rl), valType="p.two.sided")
#> [1] 0.8535404
rd1 <- rd + ifelse(rl, 0.5, 0) wmwTest(rd1, rl, valType="p.greater")
#> [1] 1.949087e-14
wmwTest(rd1, rl, valType="U")
#> [1] 90452
rd2 <- rd - ifelse(rl, 0.2, 0) wmwTest(rd2, rl, valType="p.greater")
#> [1] 0.9976222
wmwTest(rd2, rl, valType="p.two.sided")
#> [1] 0.004758899
wmwTest(rd2, rl, valType="p.less")
#> [1] 0.002379449
wmwTest(rd2, rl, valType="r")
#> [1] 0.1031357
wmwTest(rd2, rl, valType="f")
#> [1] 0.5515679
## matrix forms rmat <- matrix(c(rd, rd1, rd2), ncol=3, byrow=FALSE) wmwTest(rmat, rl, valType="p.two.sided")
#> [1] 8.535404e-01 3.898175e-14 4.758899e-03
wmwTest(rmat, rl, valType="p.greater")
#> [1] 4.267702e-01 1.949087e-14 9.976222e-01
wmwTest(rmat, which(rl), valType="p.two.sided")
#> [1] 8.535404e-01 3.898175e-14 4.758899e-03
wmwTest(rmat, which(rl), valType="p.greater")
#> [1] 4.267702e-01 1.949087e-14 9.976222e-01
## other valTypes wmwTest(rmat, which(rl), valType="U")
#> [1] 124152 90452 137887
wmwTest(rmat, which(rl), valType="abs.log10p.greater")
#> [1] 0.369805934 13.710168669 0.001033906
wmwTest(rmat, which(rl), valType="log10p.less")
#> [1] -2.416062e-01 -8.437865e-15 -2.623524e+00
wmwTest(rmat, which(rl), valType="abs.log10p.two.sided")
#> [1] 0.06877594 13.40913867 2.32249352
wmwTest(rmat, which(rl), valType="Q")
#> [1] 0.06877594 13.40913867 -2.32249352
wmwTest(rmat, which(rl), valType="r")
#> [1] -0.006748243 -0.276357949 0.103135713
wmwTest(rmat, which(rl), valType="f")
#> [1] 0.4966259 0.3618210 0.5515679
## using ExpressionSet data(sample.ExpressionSet) testSet <- sample.ExpressionSet fData(testSet)$GeneSymbol <- paste("GENE_",1:nrow(testSet), sep="") mySig1 <- sample(c(TRUE, FALSE), nrow(testSet), prob=c(0.25, 0.75), replace=TRUE) wmwTest(testSet, which(mySig1), valType="p.greater")
#> A B C D E F G H #> 0.3968848 0.4750041 0.4090897 0.5238279 0.5757461 0.4790929 0.4343313 0.6157803 #> I J K L M N O P #> 0.6872272 0.5165239 0.4499611 0.5820556 0.4755881 0.4934181 0.4666891 0.4785086 #> Q R S T U V W X #> 0.6280457 0.2976157 0.3232340 0.6152199 0.6149396 0.3576255 0.3540736 0.4102298 #> Y Z #> 0.5086293 0.4288550
## using integer exprs(testSet)[,1L] <- exprs(testSet)[,1L] + ifelse(mySig1, 50, 0) wmwTest(testSet, which(mySig1), valType="p.greater")
#> A B C D E F #> 4.427725e-06 4.750041e-01 4.090897e-01 5.238279e-01 5.757461e-01 4.790929e-01 #> G H I J K L #> 4.343313e-01 6.157803e-01 6.872272e-01 5.165239e-01 4.499611e-01 5.820556e-01 #> M N O P Q R #> 4.755881e-01 4.934181e-01 4.666891e-01 4.785086e-01 6.280457e-01 2.976157e-01 #> S T U V W X #> 3.232340e-01 6.152199e-01 6.149396e-01 3.576255e-01 3.540736e-01 4.102298e-01 #> Y Z #> 5.086293e-01 4.288550e-01
## using lists mySig2 <- sample(c(TRUE, FALSE), nrow(testSet), prob=c(0.6, 0.4), replace=TRUE) wmwTest(testSet, list(first=mySig1, second=mySig2))
#> A B C D E F G #> first 4.427725e-06 0.4750041 0.4090897 0.5238279 0.5757461 0.4790929 0.4343313 #> second 3.322462e-01 0.5830970 0.4508703 0.4503638 0.4788279 0.2795935 0.5079108 #> H I J K L M N #> first 0.6157803 0.6872272 0.5165239 0.4499611 0.5820556 0.4755881 0.4934181 #> second 0.5554548 0.6009788 0.5637822 0.4024829 0.4129130 0.5557076 0.4844365 #> O P Q R S T U #> first 0.4666891 0.4785086 0.6280457 0.2976157 0.3232340 0.6152199 0.6149396 #> second 0.5477364 0.5394920 0.4508703 0.5673067 0.4577155 0.4478326 0.4821416 #> V W X Y Z #> first 0.3576255 0.3540736 0.4102298 0.5086293 0.4288550 #> second 0.3385415 0.6612245 0.5985069 0.6346372 0.3837895
## using GMT file gmt_file <- system.file("extdata/exp.tissuemark.affy.roche.symbols.gmt", package="BioQC") gmt_list <- readGmt(gmt_file) gss <- sample(unlist(sapply(gmt_list, function(x) x$genes)), 1000) eset<-new("ExpressionSet", exprs=matrix(rnorm(10000), nrow=1000L), phenoData=new("AnnotatedDataFrame", data.frame(Sample=LETTERS[1:10])), featureData=new("AnnotatedDataFrame",data.frame(GeneSymbol=gss))) esetWmwRes <- wmwTest(eset ,gmt_list, valType="p.greater") summary(esetWmwRes)
#> 1 2 3 4 #> Min. :0.02198 Min. :0.01316 Min. :0.005433 Min. :0.007333 #> 1st Qu.:0.28609 1st Qu.:0.24582 1st Qu.:0.194880 1st Qu.:0.233636 #> Median :0.55866 Median :0.51048 Median :0.460976 Median :0.598480 #> Mean :0.55896 Mean :0.50834 Mean :0.470506 Mean :0.547411 #> 3rd Qu.:0.79758 3rd Qu.:0.72584 3rd Qu.:0.706622 3rd Qu.:0.848243 #> Max. :1.00000 Max. :1.00000 Max. :1.000000 Max. :1.000000 #> 5 6 7 8 #> Min. :0.002327 Min. :0.01225 Min. :0.00879 Min. :0.0002731 #> 1st Qu.:0.262446 1st Qu.:0.27261 1st Qu.:0.23966 1st Qu.:0.2535412 #> Median :0.531550 Median :0.58203 Median :0.52827 Median :0.5691982 #> Mean :0.523777 Mean :0.55246 Mean :0.52780 Mean :0.5251984 #> 3rd Qu.:0.773279 3rd Qu.:0.79958 3rd Qu.:0.80507 3rd Qu.:0.7977362 #> Max. :1.000000 Max. :1.00000 Max. :1.00000 Max. :1.0000000 #> 9 10 #> Min. :0.001639 Min. :0.009709 #> 1st Qu.:0.265712 1st Qu.:0.273843 #> Median :0.503677 Median :0.517334 #> Mean :0.513945 Mean :0.531123 #> 3rd Qu.:0.738335 3rd Qu.:0.817494 #> Max. :1.000000 Max. :1.000000
## using signed GMT file signed_gmt_file <- system.file("extdata/test.gmt", package="BioQC") signed_gmt <- readSignedGmt(signed_gmt_file) esetSignedWmwRes <- wmwTest(eset, signed_gmt, valType="p.greater") esetMat <- exprs(eset); rownames(esetMat) <- fData(eset)$GeneSymbol esetSignedWmwRes2 <- wmwTest(esetMat, signed_gmt, valType="p.greater")