2.2. MIDAS format¶
Section author: Thomas Cokelaer, 2013
MIDAS files are CSV files (comma separated). The content is defined in the first line of the file that constitutes the header (only 1 line). In the header, colums can take two forms:
where XX is a 2-letter word prefix that describes the column content (see table below for valid word) and Specy is the name of the column. The userword is optional (see later).
MIDAS files include the concept of cues, signals, and responses (Gaudet et al., 2005):
- cues are biological perturbations to a system (such as the addition of extracellular ligands)
- signals represent the activities of proteins or other biomolecules involved in transducing biological information (activation of an intracellular kinase, for example),
- responses also called readouts represent phenotypic changes such as proliferation, cell death or cytokine release.
The column headers in a MIDAS files may contain a second (userword) level of identification (e.g. headers for columns describing various cytokine treatments might begin with “TR:Cytokine”). When present, these secondary identifiers allow Software (e.g., DataRail’s importer) to identify automatically the dimensions of a new compendium.
|Code||Description||handled in CellNOptR|
Each value is separated by a comma and you could have space, tabs between commas. So, the final format could be as follows:
TR:mock:CellLine, TR:EGF, TR:TNFa, TR:PI3Ki, DA:Akt, DA:Hsp27, DV:Akt, DV:Hsp27 1,1,0,0, 0,0, 0,0 1,0,1,0, 0,0, 0,0 1,1,0,0, 10,10, 0.82,0.7 1,0,1,0, 10,10, 0.91,0.7
Let us explain the header
- The first row is the header describing the content of each colum.
- commas separate all fields.
- Each fields must starts with one of the valid code followed by a column (e.g., TR: or DA:).
- Subfields are possible: TR:whatever:EGF
- Special fields such as CellLine, NOCYTO and NOINHIB are ignored.
- The number of DA and DV must be equal except if you use the special name DA:ALL (see later).
- Inhibitors are coded by adding the letter i after the name (e.g., TR:PI3Ki)
do we have special cases of name ending with the letter i ?
The data above is made of rows that length is as long as the header. Fields may be empty, which is not the case here. If so, software should replace the value by (e.g., NA in R language) and cope with it.
Each row represents a given treatment at a given time. Time are coded with the DA code. Values are coded within the DV columns. Let us look at the 2 first rows. The time is 0. The next two other rows are coded for the time 10. The treatements (3 first colums) are found at the different time.
In MIDAS file, data should be ordered by time although some software may deal with it.
2.2.2. Filename issue¶
MIDAS file has a unique identifier (UID) composed of the following fields: (i) a two-letter data/file-type code (e.g., PDfor Primary Data, MD for multiplex data), (ii) a three-letter creator code (typically initials), (iii) an identification number of arbitrary length that is unique across the entire system, and (iv) a free-text suffix that serves as a mnemonic to improve human readability. For example, the primary data discussed in the text might be tagged MD-LGA-11111-CytoInh17phFI-BLK
In practice, only a few files are coded that way. One reason is that the UID tag is hardly used. Another inconsistency is that dashes are not used or replaced by _. Besides, many files contain the word Data. Finally, the name tag (e.g. LGA above) is not good practice because public file should give the feeling they belong to everybody. However, one consistency is the extension being .csv.
2.2.3. Proposal for filename convention¶
do not use the DATA/Data word. Instead start all files with MD- and use the extension .csv
separate the names that describe your data with dashes.
Underscore could be use internally to refine a name
MD must be capitalised, other names can use any convention but we recomment polish convention (e.g., capitalize words)
MD indicates that this is a MIDAS file so no need to set Data in the filename anymore. Tag1 is a general description tag (containing _ possibly) and Tag2 is a variant of Tag1. For instance, Tag1 could be Toy and Tag2 a name to differentiate different Toy data sets.
MD-Toy.csv MD-Toy-variant1.csv MD-LiverDream.csv MD-LiverDREAM.csv