Data Curation

EMAGE curation staff take incoming data from many sources, check and correct for errors, assess for consistency, and convert into a standard format which adds structure that allows subsequent data interrogation and exchange. For an in-depth report about Data Curation at EMAGE, please refer to EMAGE Case Study commissioned by the Digital Curation Centre SCARP project.

The information aspects we check and standardise for every EMAGE entry (based on both the supplied information and extra information we source) are:

The assayed Gene/Protein name/symbol
The stage of each specimen (assigned as based on morphological criteria)
Defining the strain and any mutations/alleles of each specimen
Defining as completely as possible the probe/antibody (including full sequences where possible)
Defining the experimental methods used
Transferring the sites of expression as described by the data submitter to matching terms in the EMAP anatomy ontology
Performing a spatial annotation onto an EMAP embryo model
Checking that all references, links, submitter information etc is correct

During the curation process we refer to several external sources that are the accepted authorities regarding each aspect of the information:

Information aspect	Source and Comments

Gene or Protein Symbol and Name	MGI gene/protein symbols and names. Mouse gene/protein name and symbol information is assigned according to the guidelines of the Mouse Gene Nomenclature Committee and maintained by staff at MGI. The data includes Gene Name, Gene Symbol and a unique identifier. At EMAGE we assign the correct gene/protein ID to incoming data (e.g. MGI:99604).

Mouse Strains	MGI Mouse strain information. Mouse strain information is maintained by staff at MGI. At EMAGE we assign the strain name in MGI-format to incoming data (e.g. " 129S2/SvPas * C57BL/6 * CD-1 ")

Mouse alleles	Mouse allele information. Mouse allele information is maintained by staff at MGI. At EMAGE we assign the correct allele ID to incoming data (e.g. MGI:3702935).

Nucleic Acid Sequences	INSDC sequence database. We use versioned INSDC sequence identifiers when referring to nucleid acid sequences (e.g. NM_021459.4).

Amino Acid Sequences	NCBI protein sequence database. We use versioned NCBI sequence identifiers when referring to amino acid sequences (e.g. NP_067434.3).

Probes or antisera	MGI database. If a probe or antibody has been previously described by MGI curators, we use the MGI ID (e.g. MGI:1334951). Otherwise we assign a new ID in house. These are displayed in EMAGE as "GeneNameprobeA"," ProteinNameAntibodyB" etc.

Mouse Embryo Anatomy Descriptions	EMAP Mouse Anatomy Ontology. We describe all text-based descriptions of sites of expression using the EMAP mouse anatomy ontology.

During our spatial annotation procedure, we also comment on the clarity of the expression pattern seen in the image and the morphological match between the data embryo and the EMAP embryo template which houses the spatial annotation.

Data Curation

Quicksearch Help

(Click the icon to keep this page displayed.)