Reuse of public genome-wide gene expression data
Key Points Over the past decade, high-throughput gene expression experiments have generated data from millions of assays. Data sets linked to publications are stored in functional genomics data archives: ArrayExpress at the European Bioinformatics Institute, Gene Expression Omnibus at the US Nationa...
Saved in:
| Published in: | Nature reviews. Genetics Vol. 14; no. 2; pp. 89 - 99 |
|---|---|
| Main Authors: | , |
| Format: | Journal Article |
| Language: | English |
| Published: |
London
Nature Publishing Group UK
01.02.2013
Nature Publishing Group |
| Subjects: | |
| ISSN: | 1471-0056, 1471-0064, 1471-0064 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Key Points
Over the past decade, high-throughput gene expression experiments have generated data from millions of assays. Data sets linked to publications are stored in functional genomics data archives: ArrayExpress at the European Bioinformatics Institute, Gene Expression Omnibus at the US National Center for Biotechnology Information and at the DNA Databank of Japan Omics Archive.
Secondary added-value and topical databases process data from the primary archives, adding analysis and annotation to make these data accessible to every biologist by allowing queries such as 'in which tissue is a particular gene expressed?' or 'which genes are differentially expressed between a particular disease and normal samples?'
Public gene expression data are commonly reused to study biological questions, both by reanalysis of primary data and by queries to secondary resources. Approximately half of the studies that use public gene expression data rely solely on existing data without adding newly generated data, and half of them use the public data in combination with new data.
The reproducibility of published microarray-based studies is limited, mostly owing to insufficient experiment annotation and sometimes to unavailability of the raw or processed data. A stricter enforcement of Minimum Information About a Microarray Experiment (MIAME) requirements and also development of easy-to-use experiment annotation tools are needed to achieve a better reproducibility.
Although most of the public gene expression data still are based on microarray experiments, the contribution of high-throughput-sequencing-based expression studies, known as RNA sequencing (RNA-seq), are growing rapidly.
Reuse of RNA-seq data can potentially be even more valuable than reuse of microarray data, partly owing to the costs of experiments and data storage but even more importantly because of a more quantitative nature of sequencing-based expression data. Community standards such as Minimum Information about Sequencing Experiments (MINSEQE) should be adopted to make RNA-seq data maximally reusable.
The bioinformatics resources that store and manage public data are sensitive to short-term funding changes, complicating the maintenance of important databases. The development of long-term infrastructure in bioinformatics, such as the ELIXIR project in Europe, is needed to ensure the long term availability of public data.
A wealth of microarray gene expression data and a growing volume of RNA sequencing data are now available in public databases. The authors look at how these data are being used and discuss considerations for how such data should be analysed and deposited and how data reuse could be improved.
Our understanding of gene expression has changed dramatically over the past decade, largely catalysed by technological developments. High-throughput experiments — microarrays and next-generation sequencing — have generated large amounts of genome-wide gene expression data that are collected in public archives. Added-value databases process, analyse and annotate these data further to make them accessible to every biologist. In this Review, we discuss the utility of the gene expression data that are in the public domain and how researchers are making use of these data. Reuse of public data can be very powerful, but there are many obstacles in data preparation and analysis and in the interpretation of the results. We will discuss these challenges and provide recommendations that we believe can improve the utility of such data. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 ObjectType-Review-3 content type line 23 |
| ISSN: | 1471-0056 1471-0064 1471-0064 |
| DOI: | 10.1038/nrg3394 |