We downloaded publicly available gene-expression datasets generated by either of the two microarray platforms: Affymetrix GeneChip Human Genome U133 Plus 2.0 (Affy HG-U133 Plus 2.0) and Affymetrix GeneChip Human Genome U133A 2.0 (Affy HG-U133A 2.0).
we used the ‘GEOparse’ Python library (https://github.com/ guma44/GEOparse) for downloading the datasets.
Specifically, we selected 1,000 components for cancer types with more than 1,000 samples, 500 components for those with 500 to 1,000 samples, and 250 components for those with fewer than 500 samples. Our criteria ensure that the selected components account for ~80% of the variance in almost all cancer types and 90% for most (Supplementary Table 2).