4 Testing for tissue heterogeneity

4.1 Tissue signatures

In section 2, we identified a set of 9 reference signatures (table 4.1) which unambiguously identify their corresponding tissue across platforms and species. In addition to that, we use 120 tissue signatures from the BioQC publication, which we refer to as query signatures.

Table 4.1: reference signatures
Reference Signature Tissue NA
GTEX_Blood blood/immune blood
GTEX_Blood blood/immune white blood cells
GTEX_Blood blood/immune pbmc
GTEX_Brain brain brain
GTEX_Brain brain cerebellum
GTEX_Brain brain cortex
GTEX_Brain brain frontal cortex
GTEX_Brain brain prefrontal cortex
GTEX_Brain brain hippocampus
GTEX_Brain brain hypothalamus
GTEX_Heart heart heart
GTEX_Kidney kidney kidney
GTEX_Liver liver liver
GTEX_Liver liver hepatocyte
GTEX_Muscle_Skeletal skeletal muscle skeletal muscle
GTEX_Pancreas pancreas pancreas
GTEX_Skin skin skin
GTEX_Testis testis testis

4.2 Testing samples for heterogeneity

We tested for enrichment of 120 selected signatures provided by BioQC (query signatures) and the 9 reference signatures generated by us on all 76576 selected samples resulting in a list of 9878304 (sample, signature, pvalue) pairs.

Our intention is to identify samples that show tissue heterogeneity, i.e. unintentional profiling of cells of other origin than the target tissue of profiling. We classify samples into heterogeneous and not heterogeneous. We call a classification true-positive if the given sample is classified as heterogeneous and the sample indeed contains cells different from the annotated tissues. Analogous, we call a classification false-positive if the given sample is classified as heterogeneous but in reality only contains cells from the annotated tissue.

Naively, we could label a sample as heterogeneous, if a signature unrelated to the annotated tissue exceeds a certain score. The problem with this approach is, that some signatures overlap; the resulting scores are therefore correlated and will lead to false-positives. One cannot simply solve this problem by excluding genes that are members of multiple signatures, as it is easily possible to build two (in fact many) distinct, non-overlapping signatures matching the same tissue, due to gene-gene correlation.

In section 2 we have created and validated reference signatures for 9 tissues. Even though we have demonstrated that each signature unambiguously identifies its corresponding tissue (i.e. scores highest), the signatures could still be correlated. Some of them in fact are, e.g. cardiac muscle and skeletal muscle (see figure 4.1). Moreover, we lack sufficient data to perform an independent-sample validation on the signatures provided by BioQC.
Therefore, to avoid false-positives, for each tissue, we exclude all signature that are positively correlated with the reference signature. This approach is more formally described in the following:

A given sample \(s\) annotated as tissue \(t\) is tested for enrichment with signature \(k_{\text{query}}\) resulting in a p-value \(p_{\text{query}}\). Let \(k_{\text{ref}}\) be the reference signature associated with tissue \(t\) and \(p_{\text{ref}}\) the p-value of testing \(s\) for enrichment of \(k_{\text{ref}}\). Let \(\tau\) be a certain false discovery rate (FDR)-threshold (0.01 in this study).

  1. If the Benjamini-Hochberg (BH)-adjusted \(p_{\text{query}} \ge \tau\), we assume that \(s\) is not heterogeneous; else continue.
  2. We fit a robust linear model using rlm from the R MASS package of \(|log10(p_{\text{query}})|\) against \(|log10(p_{\text{ref}})|\) for all samples annotated as \(t\).
  3. If the slope of the linear model is \(\ge 0.01\), we exclude the pair of signatures from the results. If the slope is \(< 0.01\) and the FDR-adjusted \(p_{\text{query}} < \tau\), we consider the sample as heterogeneous. Tissue pairs for which signatures are excluded are marked as such in the results.
  4. We define heterogeneity as severe, if additionally \(p_{\text{ref}} \ge\) 0.05.
Examples of signature correlation. Panels A-B: scatterplot of the signature scores (y-axis) against the scores of a reference signature (x-axis). The black line indicates the model fitted to the data. Points are colored according to the called heterogeneity status. (A) Skeletal muscle scores of kidney samples against scores of the kidney signature. The samples are not correlated, however some outliers are detected which are samples likely containing muscle cells. (B) Skeletal muscle scores of cardiac muscle samples against skeletal muscle scores. The scores are highly correlated. While most of the points exceed the FDR threshold, they will not be classified as heterogeneous since the signatures are correlated. Panels C and D show the boxplots of the scores of various signatures on kidney and heart samples, respectively.

Figure 4.1: Examples of signature correlation. Panels A-B: scatterplot of the signature scores (y-axis) against the scores of a reference signature (x-axis). The black line indicates the model fitted to the data. Points are colored according to the called heterogeneity status. (A) Skeletal muscle scores of kidney samples against scores of the kidney signature. The samples are not correlated, however some outliers are detected which are samples likely containing muscle cells. (B) Skeletal muscle scores of cardiac muscle samples against skeletal muscle scores. The scores are highly correlated. While most of the points exceed the FDR threshold, they will not be classified as heterogeneous since the signatures are correlated. Panels C and D show the boxplots of the scores of various signatures on kidney and heart samples, respectively.