Input Data Tutorial

The following tutorial introduces the data files that are used in MAGNET, how these files are formatted and explains how to prepare these files.
An extensive documentation (help pages) including sample outputs and steps taken to analyze these datasets can be found as a PDF here.

RNA Sequencing Count

RNA Sequencing Count (RNAseq) is a rapidly adopted technique to measure the relative or absolute mRNA expression in a cell. This expression data can be used to identify gene-gene correlations, regulation and systems-level expression profiles.

In MAGNET, RNAseq data can be used in two ways: by user uploaded data matrices, or by selecting TCGA data to work with. User uploaded data matrices should be formatted to process correctly. A simplified platform file is shown below:

Samples_characteristic 1 Char 1 Char 2 Char 3 ...
Samples_characteristic 2 Char 1 Char 2 Char 3 ...
ID_REF Sample 1 Sample 2 Sample 3 ...
Gene 1 2.937436519 3.117006273 2.948637709 ...
...
Gene ... 2.864035491 2.95697186 3.011919207 ...


Each row corresponds to a different gene or characteristic, and each column corresponds to the expression of that probe in a specific tissue sample. Seperation of characteristics from genes is done by the "ID_REF" row with sample names.

Scenario 1: MAGNET Job from TCGA Data

A user can access RNAseq data straight from the MAGNET website:

  1. From the homepage, select the type of job you want to run.
  2. Using the dropdown box, select the type of cancer you want MAGNET to analyze

MAGNET can then perform a variety of analyses on this data. Please see documentation for more information regarding MAGNET's services.

Scenario 2: MAGNET Job from User Provided Data Matrix

If the user would like to analyze their own RNAseq data, then the data must be formatted in the defined format. The following tutorial describes the aspects that a file must follow:

MAGNET uses the file to obtain the actual expression levels of each gene, as well as characteristics from each sample. By modeling your data like the table above, you can ensure that MAGNET is able to collect all data accurately.

  1. Ensure that characteristics of the samples come first. If there are no characteristics, or if the undesired samples have already been removed, you can skip this section.
  2. "ID_REF" followed by the sample identifiers should be the row that seperated characteristics from the expression data. If there are no characteristics, this should be the first row in the uploaded matrix.
  3. Finally, all the expression data should be in the matrix. Each row should start with the gene name, followed with the expression data for each sample.
  4. Save your file, it is now ready to be uploaded to MAGNET.

Microarray Gene Expression

Microarray Gene Expression is a high throughput technique employed by researchers to find mRNA expression levels in given samples. This data can be used to find the expression of tens of thousands of genes in a single experiment.This expression data can be used to identify gene-gene correlations, regulation and systems-level expression profiles.

The most used repository for expression data is the Gene Expression Omnibus (GEO). Researchers are required to submit their microarray data to repositories such as GEO prior to publication, which gives the rest of the research community access to expression data generated in various labs. GEO stores submitted data in three files:

  1. Sample file -
  2. Platform (GPL) file -
  3. Series (GSE) file -
  4. The latter two files, the GSE and GPL files, are required for MAGNET job requests. This brief tutorial will introduce those files, and explain how to generate them, if your data is not on this GEO.

    Platform (GPL) file

    Researchers obtain expression levels using "chips," on which the experiment is performed. These chips are designed as such that sequences of mRNA will attach to different "spots" on the chip. Each of these spots are referred to as probes. The probes are named arbitrarily and have very little meaning to researchers. The GPL file describes the correspondence between the probe names and the gene names with a great deal of additional information.

    A simplified platform file is shown below:

    Probe ID ... Gene Symbol
    1415671_at ... ATP6V0D1
    1420955_at ... APC
    ...
    AFFX-TRPNX-5_AT ... TRPNX-5


    Each row correspons to a different probe, and each column corresponds to an attribute of that probe, i.e. Gene Symbol.

    Series (GSE) file

    The actual expression data is stored in the GSE file, or the Series file. An example of a simplified series file is shown below:

    Probe ID Sample 1 Sample 3 Sample 4 ...
    1415671_at 8.873223959 8.881001388 8.549749775 ...
    1420955_at 2.937436519 3.117006273 2.948637709 ...
    ...
    AFFX-TRPNX-5_AT 2.864035491 2.95697186 3.011919207 ...


    Each row corresponds to a different probe, and each column corresponds to the expression of that probe in a specific tissue sample. Here in this example, the GPL file tells us that probe ID 1420955_at corresponds to APC. With this information, we can use the GSE file to find that APC has an expression of 2.937 in sample 1. In MAGNET, if there are multiple probes for a specific gene, the expression of a gene for a given sample is taken to be the average of the multiple expressions for that sample.

    Reference:
    Barret T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Sobeleva A, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Meurtter RN, Edgar R. Nucleic Acids Research 2008. Full text.

    Scenario 1: MAGNET Job from GEO Data

    A user can access the abundance of publicly available expression data on GEO, download the relevant files, and use MAGNET to analyze this data. The user can obtain this data from the GEO in the following manner:

    1. Visit the GEO website: http://www.ncbi.nlm.nih.gov/geo/.
    2. Click the "SEARCH" button in the upper left corner.
    3. From the "Search" drop-down list, select "GEO DataSets," and type in a keyword associated with your desired expression data. For example, to look for experiments with colorectal cancer, type "colorectal cancer," into the text box, and hit "Go."
    4. Find an acceptable GSE file (you can filter results). Click on the hypertext, "GSExxxxx record."
    5. Download the Series file by scrolling to the bottom of the page and clicking on the hypertext, "Series Matrix File(s)," and downloading the associated zipped file. This can be unzipped using decompression software available for your operating system.
    6. Go back to the GSE page, and click on the hypertext to access the Platform file. The hypertext is labelled: "GPLyyyyy."
    7. Download the GPL file by clicking "View Table" or "Download Table." Then save the page as an (ANSI) txt file.

    MAGNET can then perform a variety of analyses on this data. Please see documentation for more information regarding MAGNET's services. Please also see Example Files for examples of GPL data.

    Scenario 2: MAGNET Job from User Provided Expression Data

    If the user would like to analyze their own expression data, then the data must be formatted in the GEO defined format. The following tutorial describes the aspects of the GEO format that a files must follow:

    MAGNET uses the Series file to obtain the actual expression levels of each probe. A template is provided that contains the necessary data--the user must simply copy their data into the appropriate template, and submit it, as described below:

    Series (GSE) file

    1. Download the GSE template here, and open it in Excel or a similar program that edits tab-delimited matrices.
    2. Duplicate columns until the number of columns is one more than the number of samples in your expression data.
    3. Copy your expression data into the spreadsheet, under the row beginning with "ID_REF." Each row should start with an ID, and then the expression data for each sample.
    4. Copy sample titles into the "!Sample_title" row.
    5. Copy sample identifiers into ID_REF row. These may be the same as the sample titles.
    6. Save your file, it is now ready to be uploaded to MAGNET.

    Platform (GPL) file

    MAGNET uses the platform data to map probes to genes. As such, the only two columns required in a GPL file are the "ID_REF" column, and the "Gene Symbol" column. A template is provided that contains the necessary data--the user must simply copy their data into the appropriate template, and submit it, as described below:

    Please note that often times this is unnecessary, as popular chips already have their GPL files uploaded to GEO. If this is the case, simply search GEO for your platform, download it, and submit it to MAGNET.

    1. Download the GPL template here, and open it in Excel or a similar program that edits tab-delimited matrices.
    2. Copy probe IDs into "ID" column, and gene symbol into "Gene Symbol" column.
    3. Save your file, it is now ready to be uploaded to MAGNET.

    Popular GPL files cached on the MAGNET server can be found here.

Home | Documentation | Tutorial | Popular Platform Files | About Us
Copyright 2011-2015 Case Western Reserve University Center for Proteomics and Bioinformatics. All Rights Reserved.
Legal Disclaimer
216-368-1490 | proteomics@case.edu