3 Sample data and metadata

In this section, we describe how we obtained and curated gene expression data and sample metadata.

3.1 Standardize Tissue Names

The annotation of tissues is inconsistent within GEO and ARCHS4. A “liver” sample can be termed e.g. “liver,” “liver biopsy” or “primary liver.” We, therefore, need a way to standardize the tissue name. We manually mapped the most abundant tissues to a controlled vocabulary. Next, in order to find out which samples are subject to tissue heterogeniety, we first need to define which signatures we would expect in a certain tissue. For example, we map the signatures Intestine_Colon_cecum_NR_0.7_3 and Intestine_Colon_NR_0.7_3 to colon. Since the reference signatures are not as specific as the tissue annotation, we created tissue sets to combine them into groups. For instance, it is hard to distinguish jejunum from colon, but easy to distinguish the two from other tissues. We therefore created a tissue set intestine, which contains both jejunum and colon and references all signatures associated with the two tissues. All of the described mappings are available from Supplementary Table 1.⁹

3.2 GEO

3.2.1 Downloading GEO data

We retrieved sample metadata for GEO using the GEOmetadb package (Zhu et al. 2008). We downloaded the studies with GEOquery (Davis and Meltzer 2007) and stored them as R ExpressionSet (Huber et al. 2015) using the R script geo_to_eset.R.¹⁰ We used the annotGPL=TRUE option of GEOquery’s getGEO function to obtain gene symbols for the studies, where available. Since the tissue signatures use human gene symbols, we added human orthologs for all mouse and rat samples.

3.2.2 Filtering GEO data

We filtered GEO samples by the following criteria (figure 3.1):

the tissue or origin is annotated,
gene symbols are annotated,
the readout was performed on a single-channel microarray, and
the tissue could be mapped to the controlled vocabulary (section 3.1).
We only retained samples from the three major organisms: human, rat, mouse.
We removed studies which have been normalized per-gene and where ubiquitous house-keeping genes were not expressed.
Finally, we only retained samples originating from tissues for which a reference signature is available.

Figure 3.1: Summary of filtering steps on GEO samples

3.3 ARCHS4

In addition to GEO, we used data from ARCHS4 (Lachmann et al. 2018), a publicly available data collection of annotaed, consistently processed gene expression datasets based on RNA-sequencing. We downloaded gene expression and metadata as RData objects from the ARCHS4 website (version 8.0).¹¹

We filtered samples by the following criteria (figure 3.2):

The library is a transcriptomic cDNA library, the library strategy is RNA-seq, and either polyA or total RNA were extracted.
No single-cell RNA-seq samples (none of the annotation fields may contain the keywords “single-cell,” “single cell” or “smartseq”)
At least 500,000 reads could be mapped to genes.
The tissue could be mapped to the controlled vocabulary (section 3.1).
Finally, we only retained samples originating from tissues for which a reference signature is available.

Gene counts were normalized into TPM values before analysing them with BioQC.

Figure 3.2: Summary of filtering steps on ARCHS4 samples

References

Davis, S., and P. S. Meltzer. 2007. “GEOquery: A Bridge Between the Gene Expression Omnibus (GEO) and BioConductor.” Bioinformatics 23 (14): 1846–47. https://doi.org/10.1093/bioinformatics/btm254.

Huber, Wolfgang, Vincent J Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S Carvalho, Hector Corrada Bravo, et al. 2015. “Orchestrating High-Throughput Genomic Analysis with Bioconductor.” Nature Methods 12 (2): 115–21. https://doi.org/10.1038/nmeth.3252.

Lachmann, Alexander, Denis Torre, Alexandra B. Keenan, Kathleen M. Jagodnik, Hoyjin J. Lee, Lily Wang, Moshe C. Silverstein, and Avi Ma’ayan. 2018. “Massive Mining of Publicly Available RNA-Seq Data from Human and Mouse.” Nature Communications 9 (1). https://doi.org/10.1038/s41467-018-03751-6.

Zhu, Yuelin, Sean Davis, Robert Stephens, Paul S. Meltzer, and Yidong Chen. 2008. “GEOmetadb: Powerful alternative search engine for the Gene Expression Omnibus.” Bioinformatics 24 (23): 2798–2800. https://doi.org/10.1093/bioinformatics/btn520.