Submitted by: Submitted by CodeVib
Views: 10
Words: 1261
Pages: 6
Category: Science and Technology
Date Submitted: 05/03/2016 07:51 PM
Interactive pipeline for retrieving, processing and
visualising vast public gene expression datasets
__________________________________________________________________________
Name: Vibhor Sharma
Education: Software Engineering Undergraduate, Wuhan University, China
Nationality: Indian
Email:
vibhormagotra@gmail.com
Mobile: 008613247115103
Mentor: Mr.
Jüri Reimand
Project Description
Summary:
This project will be aimed at creating a software package in R that helps in retrieving the dataset
that are being stored in various public databases like ArrayExpress and GEO and lie there
unused. These datasets contain differentially expressed genes which could provide valuable
insights towards cancer genomics and cancer prediction
Project Aims:
This project would be aimed at
● Retrieving the large collection of gene expression data efficiently.
● Carrying out the preprocessing and quality control on the datasets which is of utmost
importance.
● Visualisation of datasets to browse through them interactively by developing a web
interface.
How the project will fulfill the aims:
● Since there are different platforms and organisms whose datasets are available, so all
these datasets need to be processed separately. For retrieving the datasets we could use
the packages such as “
GEOquery” to query the databases online and download the
necessary data. We can use the GEO DataSets ID (GDSxxx, where xxx represents some
number) to download those datasets or we could also download the simpler
tabdelimited text files like GSE series matrix files as ExpressionSet by using the
respective GSExxxx IDs. So, user could be asked to input the GSE IDs or GDS IDs to
●
●
●
●
●
download the respective datasets. After getting the GSE series matrix files we can then
see if the expression values are already in terms of log2 or not. If not then we must ...