Class DatasetCombiner
- java.lang.Object
-
- ubic.gemma.core.loader.expression.geo.DatasetCombiner
-
public class DatasetCombiner extends Object
Class to handle cases where there are multiple GEO dataset for a single actual experiment. This can occur in at least two ways:- There is a single GSE (e.g., GSE674) but two datasets (GDS472, GDS473). This can happen when there are two different microarrays used such as the "A" and B" HG-U133 Affymetrix arrays. (Each GDS can only refer to a single platform)
- Rarely, there can be two series, as well as two data sets, for the situation described above. These are 'pathological' (due to incorrect data entry by a user, back in the day) and GEO folks should be removing them eventually.
One major problem is figuring out which samples (GSMs) correspond across the datasets. In the example of GSE674, there are samples like C6-U133A (in GDS472) and C6-133B (in GDS473), which apparently, but not "officially" correspond to the same biological RNA. The difficulty is that there is no fail-proof way to determine which samples match up. We do the best we can by using the edit distance between the sample names. Ties can be a problem but for now the samples are sorted and the first best match is the one kept, on the assumption that corresponding samples will have lower numbers. (that is, sample 12929 will match with 12945, not 12955, if the edit distance among the choices is the same).
Another problem is that there is no way to go from GDS-->GSE-->other GDS without scraping the GEO web site.- Author:
- pavlidis
-
-
Constructor Summary
Constructors Constructor Description DatasetCombiner()
DatasetCombiner(boolean doSampleMatching)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description Collection<String>
findGDSforGDS(String datasetAccession)
Given a GEO dataset id, find all GDS ids that are associated with it.static Collection<String>
findGDSforGSE(String seriesAccession)
static Collection<String>
findGDSforGSE(Collection<String> seriesAccessions)
Given GEO series ids, find all associated data sets.GeoSampleCorrespondence
findGSECorrespondence(Collection<GeoDataset> dataSets)
Try to line up samples across datasets.GeoSampleCorrespondence
findGSECorrespondence(GeoSeries series)
Try to line up samples across datasets contained in a series.static Collection<String>
findGSEforGDS(String datasetAccession)
Given a GDS, find the corresponding GSEs (there can be more than one in rare cases).static Map<GeoPlatform,List<GeoSample>>
getPlatformSampleMap(GeoSeries geoSeries)
-
-
-
Method Detail
-
findGDSforGSE
public static Collection<String> findGDSforGSE(Collection<String> seriesAccessions)
Given GEO series ids, find all associated data sets.- Parameters:
seriesAccessions
- accessions- Returns:
- a collection of associated GDS accessions. If no GDS is found, the collection will be empty.
-
findGDSforGSE
public static Collection<String> findGDSforGSE(String seriesAccession)
- Parameters:
seriesAccession
- series accession- Returns:
- GDSs that correspond to the given series. It will be empty if there is no GDS matching.
-
findGSEforGDS
public static Collection<String> findGSEforGDS(String datasetAccession)
Given a GDS, find the corresponding GSEs (there can be more than one in rare cases).- Parameters:
datasetAccession
- dataset accession- Returns:
- Collection of series this data set is derived from (this is almost always just a single item).
-
getPlatformSampleMap
public static Map<GeoPlatform,List<GeoSample>> getPlatformSampleMap(GeoSeries geoSeries)
-
findGDSforGDS
public Collection<String> findGDSforGDS(String datasetAccession)
Given a GEO dataset id, find all GDS ids that are associated with it.- Parameters:
datasetAccession
- the geo accession- Returns:
- all GDS associated with the given accession
-
findGSECorrespondence
public GeoSampleCorrespondence findGSECorrespondence(Collection<GeoDataset> dataSets)
Try to line up samples across datasets.- Parameters:
dataSets
- datasets- Returns:
- sample correspondence
-
findGSECorrespondence
public GeoSampleCorrespondence findGSECorrespondence(GeoSeries series)
Try to line up samples across datasets contained in a series.- Parameters:
series
- geo series- Returns:
- geo sample correspondence
-
-