Data Curation
EMAGE curation staff take incoming data from many sources, check and correct for errors, assess for consistency, and convert into a standard format which adds structure that allows subsequent data interrogation and exchange. For an in-depth report about Data Curation at EMAGE, please refer to EMAGE Case Study commissioned by the Digital Curation Centre SCARP project.
The information aspects we check and standardise for every EMAGE entry (based on both the supplied information and extra information we source) are:
- The assayed Gene/Protein name/symbol
- The stage of each specimen (assigned as based on morphological criteria)
- Defining the strain and any mutations/alleles of each specimen
- Defining as completely as possible the probe/antibody (including full sequences where possible)
- Defining the experimental methods used
- Transferring the sites of expression as described by the data submitter to matching terms in the EMAP anatomy ontology
- Performing a spatial annotation onto an EMAP embryo model
- Checking that all references, links, submitter information etc is correct
During the curation process we refer to several external sources that are the accepted authorities regarding each aspect of the information:
Information aspect | Source and Comments |
|
|
Gene or Protein Symbol and Name | MGI gene/protein symbols and names. Mouse gene/protein name and symbol information is assigned according to the guidelines of the Mouse Gene Nomenclature Committee and maintained by staff at MGI. The data includes Gene Name, Gene Symbol and a unique identifier. At EMAGE we assign the correct gene/protein ID to incoming data (e.g. MGI:99604). |
Mouse Strains | MGI Mouse strain information. Mouse strain information is maintained by staff at MGI. At EMAGE we assign the strain name in MGI-format to incoming data (e.g. " 129S2/SvPas * C57BL/6 * CD-1 ") |
Mouse alleles | Mouse allele information. Mouse allele information is maintained by staff at MGI. At EMAGE we assign the correct allele ID to incoming data (e.g. MGI:3702935). |
Nucleic Acid Sequences | INSDC sequence database. We use versioned INSDC sequence identifiers when referring to nucleid acid sequences (e.g. NM_021459.4). |
Amino Acid Sequences | NCBI protein sequence database. We use versioned NCBI sequence identifiers when referring to amino acid sequences (e.g. NP_067434.3). |
Probes or antisera | MGI database. If a probe or antibody has been previously described by MGI curators, we use the MGI ID (e.g. MGI:1334951). Otherwise we assign a new ID in house. These are displayed in EMAGE as "GeneNameprobeA"," ProteinNameAntibodyB" etc. |
Mouse Embryo Anatomy Descriptions | EMAP Mouse Anatomy Ontology. We describe all text-based descriptions of sites of expression using the EMAP mouse anatomy ontology. |
During our spatial annotation procedure, we also comment on the clarity of the expression pattern seen in the image and the morphological match between the data embryo and the EMAP embryo template which houses the spatial annotation.