PdumBase

The Platynereis dumerilii database

Dr. Schneider's Lab


The Schneider lab Platynereis dumerilii online database PdumBase , provides a comprehensive, versatile online tool to investigate stage specific transcriptional inputs during embryogenesis and during the life cycle of the annelid Platynereis dumerilii andf other selected species (e.g. Danio rerio, Xenopus tropicalis, Nematostella vectensis, Strongylocentrotus purpuratus).

This document provides a brief description of the database content and a detailed guide on how to browse its data thorough exemplary searches. The tutorial is intended as a motivational introduction while exploring and trying out the features PdumBase has to offer as an online resource to integrate and visualize our data and findings.

Download a PDF version of this manual here

Table of Content

Section 1: Database Content

In the following, the database content as well as its structure is explained. First, the details of the raw RNA-Seq data sets are highlighted, followed by an introduction into their corresponding expression data and associated annotation profiles. Furthermore the gene expression profiling features of this software are introduced, followed by an introduction into Platynereis specific coexpression networks as well as their comparative transcriptome data.

1. RNA-Seq data sets

Early stages data set:


RNA-seq data generated by Schneider lab. Description: This data corresponds to the first comprehensive transcriptome draft during early development in Platynereis dumerilii using the de novo assembly strategy. We performed mRNA deep sequencing of distinct stages using the Illumina HiSeq sequencing system with read lengths of 75bp to 100bp(H.-C Chou et al., 2016).

Time points: 2, 4, 6, 8, 10, 12, 14 hours post fertilization (hpf). Each stage has two biological replicates. The depth of these libraries ranges from 40 to 120 million paired-end reads (see table 1.)

Table 1. Early Stages Data Set

Time points from Early Stages - data set

Time (hpf) Description Time (hpf) Description
2 Zygote 10 ~ 140-cell
4 ~ 8-cell 12 ~ 220-cell
6 ~ 30-cell 14 ~ 330-cell

Assembly: All the biological replicates, which contain about 1.5 billion reads, were assembled into 357,961 transcripts in a genome independent manner. Due to alternative splicing events, out of the total transcripts assembled, 193,310 belong to genes.

Later stages data set:

RNA-seq data generated by Jekely lab. MPI for Developmental Biology, Tübingen, Germany (M. Conzelmann, et al., 2013).

Description: In comparison with the earlier stages data, the later stages data set is of lower quality and lower sequencing depth.

Time points: This data set consist of 10 time points from 24 hours post fertilization to 3 months old adults. This set also includes female and male RNA-seq samples. There are no biological replicates (Table 2).


Table 2. Late Stages data set. Time points are shown in hours post fertilization (hpf), days (d) and months (M)

Time points from Later Stages - data set

Time (hpf) Description Time (hpf) Description
24 (hpf) Early trochophore larvae 36 (hpf) Mid trochophore larvae
48 (hpf) Early metatrochophore larvae 72 (hpf) Early nectochaete larvae
4 d Mid nectochaete larvae 10 d Errant juvenile
15 d 3-segmented errant juvenile
1 Mpre
1 Mpost
3M Adult
Male Sexualy mature adult Female Sexualy mature adult


Back to top

2. Expression data

The PdumBase web interface displays the mean FPKM (fragments per kilobase per million reads mapped) as the default measurement of gene expression. The FPKM for each replicate was obtained by normalizing the total number of mappable reads with the corresponding transcript length. A transcript or gene is considered as expressed if its FPKM is > 1. Furthermore, the FPKM for each stage was obtained by combining the replicates into a single set.


Figure 1: PdumBase Search result interface displays mean FPKM as measurement of absolute expression
The result search page displays the mean FPKM values as the default measurement of gene expression (see Figure fig1). However, FPKM values from individual samples, as well as the raw counts of each transcript can also be retrieved by clicking on the "Expression data" tab after selecting a particular transcript of interest (Figure 2). For more information we refer the reader to the Tutorial Example Section.


(A)

(B)
Figure 2: PdumBase Expression data tab interface: (A) The upper frame displays mean FPKM and raw counts data, from samples as a pool. (B) Lower frame displays expression data from individual replicas.
Back to top

3. Annotation

This section is concerned with describing the different annotations, how these were sourced from external databases for convenient browsing and data exploration specific to Platynereis dumerilii.

Uniprot Annotation

The PdumBase search results interface retrieves the Uniprot annotation data, displaying the Uniprot accession number, gene name, protein name, the species of annotation origin, and the E-value (see Figure 3).

The annotation was performed using BLASTP by aligning the transcripts with predicted open reading frames (ORF) against non-redundant SwissProt databases. A total 31,806 transcripts (17,213 genes) retrieved at least one hit using an E-value cutoff of 10-10. Among the annotated transcripts, 26% aligned to human and 19% to mouse proteins.

Pfam Annotation

We also annotated for potential protein domains by aligning all transcripts against the Pfam database. The Pfam annotation can be accessed in the database web interface by selecting the option "Show detailed annotation" on the search results page, or by clicking on the tab "Annotation" after having selected a particular transcript from the result interface (see Figure 3).

Annotation was performed using HMMER. We were able to assign Pfam domains to 32,464 transcripts (18,146 genes), identifying a total of 431,701 Pfam domains. Furthermore, out of the transcripts with domain annotations, 28,326 (15,690) were also present in the Uniprot BLASTP annotation.


Figure 3: PdumBase Search results interface displays Uniprot annotation data on the rightmost panel. Annotation data includes accession number, gene name, protein name, species and E-value. Clicking on the accession number will redirect to the UniProt page for that particular protein.

KEGG Pathways Annotation

Identifying the active biological pathways in early stages is crucial to decipher the mechanisms involved in the diversification of embryonic cells. The Kyoto Encyclopedia of Genes and Genomes (KEGG) provides well-annotated pathway databases including metabolism, genetic and cellular processing.

Our assembled transcripts were mapped to KEGG pathways. In total, 18,532 transcripts (10,132 genes) are associated with the known KEGG pathways.

In our database, the KEEG annotation is accessible by selecting the option "Show detailed annotation" as seen in Figure 4.


(A)

(B)
Figure 4: PdumBase Search results interface: (A) The search result page allows to customize the information displayed by checking one or more options from the left top corner. (B)Selecting the option: "Show detailed annotation" will show detailed gene ontology, KEGG Pathways and protein domain annotation.

Gene Ontology Annotation

The assembled transcripts were also annotated with Gene ontology (GO) terms of homologous genes. A total of 30,287 transcripts (16,498 genes) could be associated with at least one annotated GO term. The GO annotation shows high enrichment in the function associated with transcription and regulation activities in the biological process and molecular functions. The GO terms related to cell differentiation such as “cell transduction”, “cell adhesion”, “cell division” and “cell cycle” are also enriched.

All annotation information for a given transcript is summarized and displayed in the annotation tab interface (see Figure 5 ).

It is worth pointing out that one important feature of our database is that the search interface allows for the submission of searches under Blast Info, Pfam, Gene Ontology, and Kegg Pathway, making it possible to narrow down a request by a particular annotation of interest.


Figure 5: PdumBase Annotation tab interface. This tab is available once an entry from the result page has been selected. It is accessible via result page by clicking on a gene or transcript of interest followed by clicking on the Annotation tab
Back to top

4. Gene Expression Profiling

PdumBase includes a detailed gene expression profiling of the early developmental stages (2 to 14hpf). An expression profile can be interpreted as the changes in the abundance of a transcript over time.

Plots depicting these fluctuations of transcript abundance (FPKM) are shown for each transcript. Accessing this data is available via the option "Show Plots" on the search result page (Figure 6), or by clicking on the transcript of interest and selecting the tab labeled "Plots" (Figure 7).


Figure 6: PdumBase Search result interface. Expression profile plots are displayed when the option "Show plots" is selected.

Figure 7: PdumBase Plot tab interface. Shows the expression profile plot for a given transcript.
For the purpose of the expression profiling analysis we filtered out low expression transcripts. Among the assembled transcripts with predicted ORF, 18,940 transcripts and 13,160 genes were found to be expressed in at least one of the 6 stages. After clustering the genes according to their expression profile, we found a total of 15 distinct clusters (see Figure 8).


Figure 8: Heat map of 13,160 expressed genes clustered into 15 groups according to the time series patterns.
Clusters 1-4 show the obvious maternal signature with a total of 4,302 genes belonging this group. The clusters 10-15 (5827 genes) correspond to the zygotic genes with slightly different activation time points. The clusters 3 and 11 are the major maternal and zygotic groups respectively showing slow decreased and increased expression patterns. The 6th, 7th, and 8th cluster contain a set of genes whose RNAs were mainly expressed at 4, 6 and 8 hours and degraded after these stages. The 9th cluster is a less dynamic group, showing stable expression throughout all stages.

Access to the cluster information is available by selecting the option "Show other info" on the result search page and then, for a selected transcript/gene, clicking on the icon under "Coexpression info". The first tab of the new results page will display all the genes in the same cluster, along with other expression data (see Figure 9).

Back to top

5. Coexpression Networks

A coexpression network is a correlation network that describes the pairwise correlation patterns of expression data. When a set of genes are highly correlated, they may share similar biological function or be involved in the same biological pathway. A coexpression network can also be used for identifying hub genes which have high connectivity to other genes in a cluster. We used weighted correlation network analysis (WGCNA) to analyze Platynereis dumerilii expression profiling data.

For this analysis, we included a total of 13,192 genes whose FPKM was > 1 for at least one sample. Correlation values and topology overlap for the coexpression networks can be found in the database on the Coexpression information interface. This page can be reached from the search result interface by selecting the option "Show other info" and by clicking on the icon under the column "Coexpression info" in the results table. The Coexpression information interface is shown in figure 9.


Figure 9: PdumBase Coexpression information interface. Displays all the transcripts/genes in the same cluster of a given component, shows protein name, correlation, topology overlap, and expression data.
Back to top

6. Comparative Transcriptome Data

Ortholog Expression

With the aim of identifying conserved stages of development, we gathered publicly available expression data from five species for which we then identified orthologs w.r.t. Platynereis dumerilii (see tables 3 and 4) and proceeded to establish global comparison expression profiles among the ortholog groups.

Table 3. Species and protein sequences

Species and number of protein sequences for comparative analysis

Species Number of sequences
Platynereis dumerilii 28,580
Danio rerio 26,241
Xenopus tropicales 18,442
Homo sapiens 23,393
Nematostella vectensis 27,273
Ascaris suum 15,446


The ortholog expression data for a particular Platinereis dumerilii transcript, can be found in our database by selecting the option "Show other info" and by clicking on the icon under the column Ortholog Expressions for the specific transcript of interest. The resulting interface will display the ID number and expression data for the orthologs found for that transcript/gene in the other 5 species (see Figure 10).



Figure 10: PdumBase Ortholog expression profile interface. Displays expression profile plots, and expression data from the selected Platynereis dumerilii gene and the orthologs genes found in other species along with their expression and annotation data (when available).


Table 4. Ortholog genes found between species.

Number of orthologs genes between the 6 species

Species Platynereis dumerilii Danio rerio Xenopus tropicales Homo sapiens Nematostella vectensis Ascaris suum
Platynereis dumerilii 5635 5402 5051 5840 3654
Danio rerio 10784 10246 6731 4307
Xenopus tropicales 10284 6415 4140
Homo sapiens 6094 3941
Nematostella vectensis 4245
Ascaris suum


Back to top

Ortholog Groups

We also identified orthologs genes for 18 selected species (Table 5) using the program OrthoMCL. This program runs all versus all Blastp queries among all the protein sequences from these 18 species and selects the best reciprocal blast hits. Once the orthologs genes were identified, phylogenetic trees were assembled using RaxMl.

Table 5. Species included in otholog group analisys

Species and number of genes used to find orthologs groups

Class Code Species Number of genes
Lophotrochozoa pdu Platynereis dumerilii 28,580
Lophotrochozoa cte Capitella teleta 32,415
Lophotrochozoa hro Helobdella robusta 23,423
Lophotrochozoa lgi Lottia gigantea 23,851
Lophotrochozoa cgi Crassostrea gigas 26,089
Ecdysozoa dpu Daphnia pulex 30,907
Ecdysozoa tca Tribolium castaneum 16,524
Ecdysozoa dme Drosophila melanogaster 13,937
Deuterostomia spu Strongylocentrotus purpuratus 20,759
Deuterostomia sko Saccoglossus kowalevskii 34,239
Deuterostomia bfo Branchiostoma floridae 50,817
Deuterostomia dre Danio rerio 26,459
Deuterostomia xtr Xenopus tropicalis 18,442
Deuterostomia hsa Homo sapiens 23,393
Prebilateria nve Nematostella vectensis 27,273
Prebilateria aqu Amphimedon queenslandica 29,883
Prebilateria tad Trichoplax adhaerens 211,520
Preanimalia mbr Monosiga brevicollis 9,196


To access the ortholog genes for a given Platynereis dumerilii transcript/gene, select the option "show other info". If ortholog groups are found for that particular transcript, a check-mark will appear under the field "Ortholog groups". Clicking on this icon will open a new interface with four tabs:"List", "Tree-ML", "Tree-Parsimony", and Alignment (see Figures 11, 12, and 13 respectively).


Figure 11: PdumBase List tab interface. Shows the species list, code, name, ortholog protein ID and contains links to access/download the protein and cDNA sequences in Fasta format.



(A)



(B)

Figure 12: PdumBase Ortholog groups interface: (A) Phylogenetic tree among ortholog genes displayed under Tree-ML tab (B) Phylogenetic tree displayed under Tree-Parsimony tab. Both trees show the species code and the transcript/gene ID.

Figure 13: PdumBase Alignment tab interface under Ortholog groups. Displays CLUSTAL 2.1 multiple sequence alignment.


Back to top

Section 2: Tutorial Examples



Example 1. Searching By Keyword



This section will show some of the PdumBase features through exemplary searches using the blast info search function.

Search

The search interface allows to submit searches under different criteria: By keyword, Pfam, SingalIP, TmHMM, EggNog, Gene Ontology, and KEEG Pathway (Figure 14). By searching under different or combined fields, the search can be customized according to the user needs.


Figure 14: PdumBase Search interface


In addition, the search interface offers the option of selecting a sorting criteria to retrieve the results according to the expression values from any stage (2 to 14hpf) (Figure 15). This feature can be particularly convenient when searching with terms that might result in a multitude of hits such as "cell cycle" which retrieves more than 1000 genes, or "membrane" with around 500 hits. Therefore, searching for general terms might result in a request which could take more than 60 seconds to load. Please allow time for those general searches to load.

On the other hand, when searching for a particular gene name, for instance the transcription factor FoxA2 in the field By keyword, the most likely outcome will be one single hit displaying the Platynereis dumerilii transcript/gene with that particular annotation.


Figure 15: PdumBase Search interface. Searching for FoxA2


Search Results


The resulting search results interface displays by default the transcript or gene model ID, protein name, expression data as mean FPKM from early stages (2 to 14 hpf), expression data in inhibitor experiment, and annotation information (Figure 16).

In addition, the results interface allows to expand the results displayed by selecting from the options on the left upper corner. The user can select one or more options according to his/her particular research needs (see also Expanded search result options Section).


Figure 16: PdumBase Search results interface Shows Gene ID, expression data from early stages. The data retrieve options are found on the left upper corner.


Access to Detailed Information

Clicking on the gene model for FoxA2, "comp221418_co", will give access to the detailed data results interface. The detailed data results page has three tabs: Plot, Expression data and Annotation, from which different information can be accessed.

The Plot Tab

Clicking on the Plot tab will display expression profile data (FPKM values against stages) for early and late states (Figure 17).


Figure 17: PdumBase Plot tab from Detailed data results interface. Displaying expression profile plots for FoxA2


The Expression Data Tab

The Expression data tab will show mean and individual sample FPKM values as well as raw counts (Figure 18).


(A)



(B)

Figure 18: PdumBase Expression data tab: (A) Displays expression data (mean FPKM and raw counts) from pooled samples from early stages of normal development. (B) shows individual replicates expression data for early stages

The Annotation Tab

Clicking the annotation tab will retrieve a summary of all annotation related information including: Species from which the annotation was obtained, GO extended annotation, KEEG pathways, EggNog, and Pfam domains (see figure 19).


Figure 17: PdumBase Annotation tab from Detailed data results interface. from Detailed data results interface. Displaying detailed annotation information for FoxA2


Expanded Search Result Options

The search result default data output can be expanded by selecting the options provided in the search results interface (Figure 18).


Figure 18: PdumBase Search results interface checking the boxes from the search result options on the left will expand the results displayed.


Selecting "Show plots"

Selecting the "show plots" option will retrieve a visual representation of early and late stage expression profile for all the Gene IDs displayed in the search result interface (Figure 19).


Figure 19: PdumBase Search results interface with the option "Show Plots" selected. Expression plots for both, early and late stages are shown for the gene under search: FoxA2


Selecting "Show later stages"

To display the mean expression data (FPKM) from later stages of development (24hpf to 3M) it is required to select the option "show later stages" as shown in figure 20.


Figure 20: PdumBase Search results interface with the option "Show later stages" selected. Here the later stages expression data from FOXA2 is displayed.


Selecting "Show other info"

Clicking "Show other info" provides access to additional data on comparative transcriptomics (see Figure 21):

  • Ortholog Expressions - if available a green check-mark icon will be displayed.
  • Ortholog groups - if available a green check-mark icon will be displayed.
  • Coexpression info - if available a blue icon will be displayed.

Figure 21:PdumBase Search results interface "Show other info" option selected. Additional information links are displayed.


It is important to mention that the additional data is not available to all the gene models but only to those transcripts for which orthologs genes were identified. See Table 4 for the estimated numbers of orthologs found.

Coexpression link

Selecting the coexpression link gives access to data about the expression profiling and coexpression. "The same cluster tab" from this interface displays the Gene ID of all genes belonging to the cluster of the gene under search (see Figure 22).


Figure 22: PdumBase Coexpression info link displays the list of genes clustered with the gene under study. The expression profile of FoxA2 clusters with 7 other genes belonging to cluster 7.


Orthologs groups link

Clicking the Orthologs groups link gives access to an interface with tree tabs: List, Tree-ML and Tree-Parsimony. As mentioned in the section "Comparative transcriptome data", 18 species were selected to assess the ortholog groups. The first tab shows the list of species from which orthologs were found for the searched gene. This interface also allows to download the protein and cDNA sequences of the orthologs in Fasta format (see Figure 24).

The second and third tab under the Orthologs groups link will display phylogenetic trees based on ML and parsimony analysis respectively. Figure 25 shows the tree-ML for the FoxA2 ortholog genes.


Figure 23: PdumBase Search results interface "Show other info" option selected. The Ortholog groups link displays the list species where orthologs were found. For FoxA2, orthologs were found in all of the 18 selected species.



Figure 24: PdumBase Search results interface "Show other info" option selected. The Ortholog groups Tree-ML tab displays phylogenic tree constructed with the ortholog protein sequences. Tree-ML for FoxA2 orthologs among the 18 species.

Back to top

Example 2. search for "Homeobox genes"

This final example will show a sample search with multiple results, indicating the options that our web database offers to download the data in case further analysis is required.

Finding homeobox genes that are highest expressed at 8hpf

Searching for homeobox term in the blast field at the search interface will retrieve 114 hits. To find the highest expressed homeobox genes at 8 hpf, is is required to sort the hits by expression values at 8 hpf in descending order (See Figure 26).


Figure 25: PdumBase Search interface. search fields required to find homeobox genes that are highest expressed at 8hpf.

Back to top

Downloading results from PdumBase

One important feature of our web database is that it allows to download the search results in different formats. The search results can be downloaded in both, comma separated value (CVS) format file and Excel file. Furthermore, the protein sequences from the genes displayed in the results can be downloaded in Fasta format. Links to download are found in the upper frame of the search result interface (see Figure 26).


Figure 26: PdumBase Search results interface. Here the result page is displaying the ten top hits, sorted by expression level at 8 hpf. Links to download data are shown with a floppy disk icon and are found in the upper frame.

Concluding Remarks
Given the here presented features and ease of use that PdumBase offers, we are confident that this work will provide a reliable resource to the community for transcriptome studies due to its extensive content and user friendly design.

Back to top

References
  • H.-C. Chou, M. M. Pruitt, B. R. Bastin, and S. Q. Schneider, “A transcriptional blueprint for a spiral-cleaving embryo,” BMC Genomics, vol. 17, no. 1, p. 552, Aug. 2016.
  • M. Conzelmann, E. A. Williams, K. Krug, M. Franz-Wachtel, B. Macek, and G. Jékely, “The neuropeptide complement of the marine annelid Platynereis dumerilii.,” BMC Genomics, vol. 14, p. 906, 2013.