 |
 |
Pre Ensembl Release of Opossum genome |
3rd Dec 2004 |
We are pleased to announce the pre-Ensembl site for the first
preliminary assembly for Monodelphis domestica (the opossum genome).
The Pre-site is available at: http://pre.ensembl.org/Monodelphis_domestica/.
The project coordination and genome sequencing and assembly is provided
by the Broad Institute.
The assembly has a base coverage of approximately 7.19X, constructed from
19348 supercontigs, having N50 length 4047488. The total contig length is
3492108230, spanning 3559101070 bases (including gaps). The pre site
offers BLAST and SSAHA access and a limited raw compute showing where
Genscan ab initopredictions are, raw BLAST hits, eponine hits, CpG islands
and repeats. In addition, the M.domestica pre site presents preliminary
protein based gene models built by a cut down Ensembl genebuild pipeline.
As this is a preliminary site and this is the first marsupial assembly
some programs will have been run with default parameters and so may give
unpredictable results on opossum.
|
| More |
 |
Ensembl pre-release: Xenopus tropicalis |
5th Nov 2004 |
We are pleased to announce the first pre-ensembl site for a preliminary
assembly of Xenopus tropicalis. The pre-site is available
here.
The Xenopus tropicalis genome assembly 3.0 is the third of a series of
preliminary assembly releases by the JGI that are planned as part of
the ongoing X. tropicalis genome project. The current assembly includes
approximately 7X in small insert end-sequence coverage.
The assembly was constructed with the JGI assembler, Jazz, using paired
end sequencing reads. After trimming for vector and quality, 19.1
Million reads assembled into 27,064 scaffolds totaling 1.63 Gbp.
Roughly half of the genome is contained in 392 scaffolds all at least
1.2 Mb in length.
The assembly can be downloaded directly from JGI at:
http://genome.jgi-psf.org/frog4x1/frog4x1.home.html
|
| More |
 |
Ensembl version 26 released |
3rd Nov 2004 |
The Ensembl team are pleased to announce the release of version 26 of
Ensembl. This release includes a new human assembly and gene build in
addition to fixes/updates in other species.
New Data
Human
Human NCBI build 35 is the latest version of the human genome which
has a number of small gaps and rearrangements with respect to the
previous build (34), mainly in pericentromeric regions. Ensembl 26
contains a complete new gene build on this assembly, in which the
automated predictions are supplemented by some gene structures drawn
from manually-annotated resources such as Vega.
The new gene build has been carefully assessed with respect to the
previous build. This has shown a decrease of entirely missing genes
(to only 85 missing cases from Swissprot) and an increase in
complete (Met-to-STOP) predictions, and an increase in UTR
containing transcripts. This assessment process also provided a list
of genes that can be improved, and we will be releasing a patched
set of gene predictions in December.
We expect to progressively improve the gene set over this year, and
we are interested in all reports of missing genes or incomplete
structures where there is data for the complete structure. Please
send a report via our helpdesk (helpdesk@ensembl.org).
- Core
- New NCBI35 assembly
- New gene build (including pseudo-, ncRNA, and mitochondrial genes)
- SNP
- dbSNP121 mapped to the new assembly
Note that there are currently no EST or ESTgene databases for
NCBI35.
The human database version reflects the new assembly version, e.g.
homo_sapiens_core_26_35.
Mouse
- Core
- marker_feature/marker_map_location fixed
- supercontig names fixed
- New AffyMetrix probe mapping
- GO terms have been mapped to transcripts via UniProt
The mouse database version has been bumped to 33b, e.g.
mus_musculus_core_26_33b
Chicken
The chicken database version has been bumped to 1c, e.g.
gallus_gallus_core_26_1c
Rat
- Core
- New AffyMetrix probe mapping
The rat database version has been bumped to 3d, e.g.
rattus_norvegicus_core_26_3d
Zebrafish
- Core
- New AffyMetrix probe mapping
The zebrafish database version has been bumped to 4a, e.g.
danio_rerio_core_26_4a
Tetraodon
The Tetraodon database version has been bumped to 1a, e.g.
tetraodon_nigroviridis_core_26_1a
Multi-species
New comparative data (ensembl_compara_26_1)
- Human/Chimp BLAST_NET from UCSC (as well as BLASTZ_NET_TIGHT generated in-house)
- Human/Mouse BLASTZ_NET from UCSC (as well as BLASTZ_NET_TIGHT generated in-house)
- Human/Rat BLAST_NET from UCSC (as well as BLASTZ_NET_TIGHT generated in-house)
- Human/Chicken BLAST_NET from UCSC (as well as BLASTZ_NET_TIGHT generated in-house)
- Human/fugu TRANSLATED_BLAT
- Human/tetraodon TRANSLATED_BLAT
- Human/chicken TRANSLATED_BLAT
- Human/Zebrafish TRANSLATED_BLAT
- New synteny for Human/Chimp, Human/Mouse, Human/Rat, Human/Chicken
- Orthologues rebuilt
- Human paralogues rebuilt
- Protein clusters rebuilt. Multiple alignments run with MUSCLE on each
family, except family ENSF00000000041 that has been run with CLUSTALW.
Mart database (ensembl_mart_26_1)
Schema changes
Core
Probeset data has been moved from the misc_feature_table to three
new tables. This both extends the data associated with probes, and
greatly improves retrieval speed for other misc_features.
- 3 new tables: affy_feature, affy_probe, affy_array
- external_db table: db_name is now varchar(27)
- density_feature: added a index on seq_region_id
The 25 to 26 patch file is available in CVS at
ensembl/sql/patch_25_26.sql. In addition, the script
ensembl/sql/transfer_misc_affy.pl can be used to move the affy data
from misc_feature table to the new tables.
Compara
- synteny_region table: new column 'method_link_species_set_id'
- dnafrag_region table: 2 columns renamed from seq_start, seq_end to
dnafrag_start and dnafrag_end respectively.
- When possible primary keys are now UNSIGNED
- genomic_align_block_id changed to UNSIGNED BIGINT in genomic_align and
genomic_align_block tables
- genomic_align_id changed to UNSIGNED BIGINT in genomic_align and
genomic_align_group tables
- perc_id changed to UNSIGNED TINYINT in genomic_align_block table
- level_id changed to UNSIGNED TINYINT in genomic_align table
In ensembl-compara/sql/table.sql all foreign-key constraints are
now explicitly defined (even though MySQL ignores them).
Website Changes
ExportView
- A new tab, "Pip", has been added. This exports a sequence and annotation
file for use in comparative sequence analysis tools like PipMaker, Vista
or zPicture.
FeatureView
- FeatureView displays the location of all alignments of the selected
feature against the genome. Currently the display works for probe sets,
DNA and protein sequences. You can get to FeatureView from the alignment
tracks in ContigView, from TextView and from the Affymetrix probes in
GeneView.
Availability
The Ensembl FTP site is currently being updated with new copies of
all databases and flatfiles. This should be complete within a day
or so. Your patience is appreciated during this process.
The databases will also be copied to the public MySQL server,
ensembldb.ensembl.org, within the next few days.
The databases included in this release are:
anopheles_gambiae_core_26_2b
anopheles_gambiae_estgene_26_2b
anopheles_gambiae_lite_26_2b
anopheles_gambiae_snp_26_2b
apis_mellifera_core_26_1
caenorhabditis_briggsae_core_26_25
caenorhabditis_briggsae_estgene_26_25
caenorhabditis_elegans_core_26_116a
danio_rerio_core_26_4a
danio_rerio_est_26_4a
danio_rerio_estgene_26_4a
danio_rerio_lite_26_4a
danio_rerio_snp_26_4a
drosophila_melanogaster_core_26_3b
ensembl_compara_26_1
ensembl_go_26_1
ensembl_mart_26_1
fugu_rubripes_core_26_2c
fugu_rubripes_est_26_2c
fugu_rubripes_estgene_26_2c
gallus_gallus_core_26_1c
gallus_gallus_est_26_1c
gallus_gallus_estgene_26_1c
gallus_gallus_lite_26_1c
gallus_gallus_snp_26_1c
homo_sapiens_core_26_35
homo_sapiens_disease_26_35
homo_sapiens_haplotype_26_35
homo_sapiens_lite_26_35
homo_sapiens_snp_26_35
homo_sapiens_vega_26_35
mus_musculus_core_26_33b
mus_musculus_est_26_33b
mus_musculus_estgene_26_33b
mus_musculus_lite_26_33b
mus_musculus_snp_26_33b
pan_troglodytes_core_26_1
rattus_norvegicus_core_26_3d
rattus_norvegicus_est_26_3d
rattus_norvegicus_estgene_26_3d
rattus_norvegicus_lite_26_3d
rattus_norvegicus_snp_26_3d
tetraodon_nigroviridis_core_26_1a
|
| More |
 |
Ensembl pre-release: Cow Genome |
6th Oct 2004 |
Btau_1.0 is a preliminary 3x assembly of the draft genome sequence of cow (Bos taurus), Hereford breed, using whole genome shotgun (WGS) reads from small insert clones. The project coordination and genome sequencing and assembly is provided by the Human Genome Sequencing Center at Baylor College of Medicine.
The N50 size is the length such that 50% of the assembled genome lies in blocks of the N50 size or longer. The N50 of the contigs is 4.2 kb. The N50 of the scaffolds is 13.5 kb. The total length of all contigs is 2.26 Gb. When the gaps between contigs in scaffolds are included, the total span of the assembly is 2.34 Gb.
As this is a pre-release, the database does not contain any genes. Subsequent annotations including the ensembl genebuild are ongoing and will be added as soon as they are completed. Future assemblies will include WGS sequences with a larger insert sizes, BAC end sequences, BAC sequences, and marker information for more contiguous assembly, better scaffolding, and chromosome assignment.
|
| More |
 |
Ensembl version 25 released |
4th Oct 2004 |
The Ensembl team are pleased to announce the release of version 25 of
Ensembl. The main data updates in this release are in the Compara
database, which has both new data and some schema changes.
New Data
There are no new assemblies in Ensembl v25.
Chicken
- SNP
- New database from dbSNP 122
The chicken database version has been bumped to 1b, e.g.
gallus_gallus_core_25_1b
Mouse
- Core
- Added FPC BAC map data to misc_features
- Added clone map data to misc_features
- Added accessioned clone map data to misc_features (subset of clone map)
- Added 1Mbase clone set to misc_features (see Chung et al, Genome Research 2004 14:188-196)
The mouse database version has been bumped to 33a, e.g.
mus_musculus_core_25_33a
Multi-species
- Comparative genomics (ensembl_compara_25_1)
- Mouse/Rat BLASTZ_NET from UCSC (as well as BLASTZ_NET_TIGHT generated in house)
- Mouse/Chicken BLASTZ_NET from UCSC (as well as BLASTZ_NET_TIGHT generated in house)
- C.elegans/C.briggsae BLASTZ_GROUP_TIGHT now in
- Tetraodon/Fugu BLASTZ_GROUP_TIGHT now in
- Mouse/Rat Synteny added
- New synteny for Mouse/Chicken
- Data bugs fixed:
- Some cigar_line inversions have been corrected
- Added N, S, dN, dS, LnL, threshold_on_ds values to the human paralogues
- Mart database (ensembl_mart_25_1)
- New build
- Can now filter on Uniprot ID lists and Uniprot Accession lists, as well
as SWProt and SPTrembl. These ID/accessions are also available as
attributes.
Schema changes
Core
Tables gene_stable_id, exon_stable_id, transcript_stable_id & translation_stable_id
- "created_date" and "modified_date" columns added (back). Both have
been set to '2004-09-20 00:00:00'.
Compara
- Deleted Tables
- 'source'
- 'method_link_species' (replaced by the new 'method_link_species_set' table)
- New tables
- 'method_link_species_set'
- 'genomic_align'
- 'genomic_align_group'
- Modified Tables
- 'genomic_align_block' has lost some data now transfered to new tables
'genomic_align' and 'genomic_align_group'
- 'member' has now 'source_name' instead of 'source_id'
- 'family' has now 'method_link_species_set_id' instead of 'source_id'
- 'homology' has now 'method_link_species_set_id' instead of
'source_id'
- The changes were necessary
- to enable whole genome multiple alignments storage/querying
- to improve the method_link_species_set data (formely in
method_link_species) consistency with the data actually present in
compara
Full details of the Compara schema and these changes can be found
in CVS in ensembl-compara/docs/schema_doc.html
Availability
The Ensembl FTP site is currently being updated with new copies of
all databases and flatfiles. This should be complete within a day or
so. Your patience is appreciated during this process.
The databases will also be copied to the public MySQL server,
ensembldb.ensembl.org, within the next couple of days.
|
| More |
 |
Ensembl version 24 released |
10th Sep 2004 |
The Ensembl team are pleased to announce the release of version 24 of
the Ensembl website. This release sees the inclusion of two new
species into Ensembl - Honey Bee (Apis mellifera) and a Fresh
Water Pufferfish (Tetraodon Nigroviridis), and new assemblies
for Zebrafish (Danio rerio) and Mouse (Mus musculus).
New Species Data
Honeybee (Apis mellifera)
Ensembl 24 presents an annotation of release 1.1 of the Apis
mellifera genome assembly. The honeybee genome sequence was
determined by whole genome shotgun at the Human Genome Sequencing
Center at Baylor College of Medicine.
The honeybee release includes:
The data comprises:
Repeats and low complexity sequence identified with RepeatMasker
(using the Drosophila melanogaster repeat library) and Dust.
Ab initio gene predictions generated with Genscan.
Blast features showing similarities to entries in Swall from a
sensitive search. There are also similarities to the the
Drosophila melanogaster proteins and proteins from
Anopheles gambiae (est) gene predictions.
Gene predictions generated from a combination of evidence sources:
honeybee-specific peptides, Drosophila melanogaster-specific
peptides, Anopheles gambiae (est) gene predictions,
honeybee-specific ESTs and UniProt/Swiss-Prot and
UniProt/TrEMBL. This set is incomplete due to a lack of
honeybee-specific evidence.
New genes (compared to pre-site no new assembly). No stable ID mapping
required.
Danio Rerio
This release includes the zebrafish assembly version 4 (Zv4), as
released on the 12th July 2003. This assembly was produced by
integrating the whole genome shotgun assembly with data from the
physical map.
There are new core, EST and EST gene databases, new SNP and lite
databases.
Drosophila melanogaster
Updated core translation table to include stop codon in translation
(protein and transcript sequences unaffected). Gene set still based
on FlyBase release 3.1.
Database renamed to drosophila_melanogaster_core_24_3b
Chicken
New SNP database. New lite database.
Mouse
This release provides a full Ensembl gene build for the NCBI m33
mouse assembly (freeze May 27, 2004). After extensive QC,
principally from the Sanger Institute, most artefactual assembly
issues introduced in build m32 have been removed. The whole genome
N50 is 22.3 Mb. (Build m32 was 17.7 Mb).
New software systems have improved the gene set. More than 85% of
genes from build m32 retain the same Ensembl gene ids in this
release. New gene identifiers were assigned where a many-to-one or
many-to-many mapping of old genes to new gene structures was
detected.
The interpolated mouse map will be included in the next release and
patches to the build will be provided regularly as more detailed
analysis is performed.
New core, est and estgene databases built on the NCBIM33 assembly.
New SNP and lite databases.
Tetraodon nigroviridis
First release of the Tetraodon nigroviridis genome project sequence
data from Genoscope and the Broad Institute (MIT).
The genome assembly was performed using Arachne (Jaffe
D.B. et. al. 2003. Gen. Res. 13, 91-96). This site presents
version 7 of the assembly.
Genes were annotated by Genoscope, combining evidence from Geneid,
Genscan, Genewise and Exofish predictions with alignments of
Tetraodon cDNAs to the genome. This was done automatically using
GAZE (Howe K., Chothia T. and Durbin R. 2002. Gen. Res. 12, 1418-27)
with a custom-designed configuration and gene structure model.
The annotation also includes 87 manually curated structures of a
number of HOX and Cytokine genes.
New core database. The assembly, gene set and other annotation
features have been provided by Genoscope.
Data Changes
Compara
- New whole genome alignments for Danio, Mouse, Tetradon (as these
are new assemblies).
- New homology and family data
- Family rebuilt to incorporate new genomes (Honeybee, Danio,
Mouse, Tetraodon). MUSCLE was used for the family multiple
sequence alignments rather than ClustalW. Families 1, and 17
were unable to run with MUSCLE and were run with ClustalW. All
others were run with MUSCLE. All families have multiple
alignment CIGAR lines defined for their peptide members.
- In-house production of BLASTZ
- mouse vs human
- mouse vs rat (will be updated with UCSC data in october release)
- mouse vs chicken (will be updated with UCSC data in october release)
- C. elegans vs C. briggsae
- In-house production of translated BLAT
- mouse vs zebrafish
mouse vs Fugu rubripes
mouse vs Tetraodon nigroviridis
mouse vs chicken
- zebrafish vs Fugu rubripes
zebrafish vs Tetraodon nigroviridis
zebrafish vs chicken
zebrafish vs rat
- Tetraodon nigroviridis vs chicken,
T. nigroviridis vs Fugu rubripes,
T. nigroviridis vs human,
T. nigroviridis vs rat
Schema changes
Core database
- Added 'display_label' column to prediction_transcript.
- Changed indices on align feature tables to improve performance of range queries.
- SQL has been provided to enable schema 23 databases to be patched to
schema 24 without the need to re-download the data.
Compara database
- New tables: peptide_align_feature, analysis
- Changed tables:
- added NOT NULL to dnafrag.dnafrag_type, sequence.length, and
sequence.sequence (backwards compatible)
- homology: added column subtype varchar(40)
- homology_member : added column peptide_align_feature_id int(10)
The homology.subtype is a more detailed classification of the nature of
the homology.
Changes are transparent to both MART and the web.
- Extended Protein/Gene homology algorithm:
Adapted for cases where there are equal 'best' hits (same query peptide
hits multiple target peptides with same score, evalue, %identity,
%positivity). Usually caused by target peptides having identical sequence.
- Extended BRH labeling
New homology description naming to correspond with algorithm
changes. The old naming from schema 23 was BRH and RHS. BRH is
now divided into 2 different naming categories:
- UBRH - (Unique Best Reciprocal Hit) These are BRHs where
there is only one uniquely best hit in both directions. Or a
simple 1-to-1 BRH
- MBRH - (Multiple Best Reciprocal Hit) These are BRHs where
there were multiple but identical best hits in one or both
directions. This can occur when there is perfect protein
sequence duplication of translated genes within a species. In
the old algorithm a random BRH was picked from the equally
bests, now they are all reported.
RHS - (Reciprocal Hit base on Synteny): unchanged from schema 23
For descriptions/types UBRH and RHS there are no subtypes defined yet.
Website Changes
SNPView
- The SNP neighbourhood image now has a Features drop-down menu,
similar to the menus on ContigView. This menu provides options for
displaying all SNPs, just genotyped SNPs and different transcripts
on the image.
- The selected SNP is highlighted in the neighbourhood image.
Sitemap
Updated sitemap for www.ensembl.org
Drawing code
Code simplified to allow more tracks to be created without writing
additional modules - by just using the drawing code configuration.
Configuration-only drawing code simplified by addition of
"add_tracks" family of calls to the EnsEMBL::Web::UserConfig.
GeneView
Ab-initio predictions now shown on the transcript neighbourhood image.
API
Registry added: a central static hash for storage/retrieval for all
the adaptors. Adaptor calls are backwards compatable and all old
code should work exactly the same as previously but underlying code
will now utilise the Registry. New methods allow easier access to
adaptors via the Registry.
FTP Site Changes
Availability
The Ensembl FTP site is currently being updated with new copies of
all databases and flatfiles. This should be complete within a day
or so. Your patience is appreciated during this process.
The databases will also be copied to the public MySQL server,
ensembldb.ensembl.org, within the next few days.
|
| More |
 |
Ensembl version 23 released |
26th Jul 2004 |
We are pleased to announce the release of Ensembl v23. This release
sees a number of data additions and improvements, including human
paralogues, ncRNAs, new SNPs, and improved Affy mappings. Also in
this release are improvements to the website such as sequence markup
and coverage graphs in BlastView, display of Compara DNA-DNA and gene
homology alignments in AlignView, and configurable markup of gene
sequence from GeneView.
New Data
There are no new assemblies in Ensembl v23.
Human
Human release 23 contains some data types new to Ensembl: ncRNAs
and selenocysteine proteins have been added for the first time.
The ncRNA mappings come from Sean Eddy and Tom Jones, and include
micro-RNA sets. We are investigating with Rfam how to extend ncRNA
annotation to other vertebrate species.
The human dataset now contains 23 selenocystine proteins with the
correct recoding of the TGA codon to selenocystine (U). These data
are modelled in the schema as translation attributes.
In addition, the human gene set has had a number of small changes
to improve some otherwise troublesome gene structures and reannotate
some starting codons to more realistic ATG positions than those
submitted from cDNA projects.
- Core
- Improved gene set
- Addition of ncRNAs to gene set
- Addition of selenocysteines to translation model
- Fixed misc_attrib for 32K BACs, changed 'non_ref' to 'name'
- Addition of ENCODE regions to misc_features
- New Affy mappings
- EST
- ESTgene
- Core, EST, ESTgene, Vega:
- Lite
- Updated with the new SNP and gene data
- SNP
- New database from dbSNP 121 & schema change (see schema changes below)
As a result of these changes, the human database version has been
bumped to 34e, e.g. homo_sapiens_core_23_34e
Mouse
- Core
- GO mappings added.
- Duplicate Exon stable IDs fixed
- New Affy mappings
- SNP
- New database from dbSNP 121 & schema change (see schema changes below)
- Lite
- Updated with the new SNP data
The mouse database version has been bumped to 32c,
e.g. mus_musculus_core_23_32c
Rat
- Core
- GO mappings added
- RGD symbols mapped to genes
- New QTLs
- New Affy mappings
- SNP
- New database from dbSNP 121 & schema change (see schema changes below)
- Lite
- Updated with the new SNP data
The rat database version has been bumped to 3c, e.g. rattus_norvegicus_core_23_3c
Zebrafish
The zebrafish database version has been bumped to 3c,
e.g. danio_rerio_core_22_3c
Chicken
- Core
- Add BACend data to misc_features
- SNP
- New database of BGI SNP data
- Lite
- New database with the new SNP data
The chicken database version has been bumped to 1a,
e.g. gallus_gallus_core_23_1a
Mosquito
- Core
- Fixed exon rank of prediction transcripts. They should begin at 1,
but began at 0.
- SNP
- schema change (see schema changes below)
The anopheles database version has not changed.
Multi-species
Comparative genomics (ensembl_compara_23_1)
- Addition of human recent paralogues (see below)
- Complete rebuild of all orthologues
- New protein clustering
- Schema change (see schema changes section below)
The new Compara data release now includes information on recently
duplicated human genes. A raw set of gene homologies was derived by
performing all-against-all blast against a dataset of Ensembl
predicted genes (including mouse and rat genes for outgroups). The
resultant homolog pairs were then filtered and ranked according to
genetic distance and gene coverage. Groups of duplicated human
genes were determined by clustering genes that shared common
reciprocal matches. The genetic distance cut-off that defined the
extent of a gene group was dynamically set as the distance to the
most-related rodent gene. Hence, groups of recently duplicated
genes identified in this manner have phylogenetic meaning and can be
formally defined as being paralogous human genes that have arisen
since the human/rodent divergence.
Mart database (ensembl_mart_23_1)
Schema changes
SNP
- Table Freq
- count column smallint(5) unsigned changed to float
- Table SubSNP
- added a column strand_to_rs tinyint(4)
Compara
- Table method_link_species
- changed index from
UNIQUE method_link_id (method_link_id,species_set,genome_db_id)
to
KEY method_link_id (method_link_id,species_set,genome_db_id)
to allow intra-species data set such as the new human paralogues set.
Website Changes
MultiContigView
- Can now locate homologous regions by gene homology, as well as
the original DNA-DNA alignment method. Orthologues on GeneView are
now linked into MultiContigView.
- Can show links between homologous transcripts (select "Join
transcripts" from the "Compara" menu).
- Simple features (SNPs, Eponine, tRNA, etc) can now be shown on
MultiContigView
GeneView
- Now has a link ("Sequence Markup") that displays the genomic
sequence of the gene, optionally marked-up with exons, SNPs, and
line numbers.
- The gene neighbourhood image at the top of GeneView now has a
Features drop-down menu, similar to the menus on ContigView. This
menu provides options for displaying SNPs and different transcripts
on the image.
BlastView
- Addition of an ncRNA BLAST database for human
- The SETUP page has been extended:
- Query sequences can be loaded via ID (EMBL/Uniprot/RefSeq) e.g. NM_002931
- Searches can be run using one of five pre-set sensitivity levels; exact,
near-exact, near-exact (oligo), local mismatch, and distant homology.
- New features on the DISPLAY page:
- The top 'n' alignments (various sort options) to display can now be
specified.
- There is a new graph that displays the location of matches on the length
of the query sequence.
- For the alignment summary table, there are new options to allow the
alignment location to de displayed in any coordinate system.
- A new page (follow the "[G]" link) shows genomic sequence (with
user-definable coordinate system, orientation, and length of flanking
sequence) with the following features highlighted:
- Exons (Ensembl, VEGA, ESTGene etc)
- SNPs
AlignView
- This page has been extended to display Compara DNA-DNA and gene homology
alignments, in a variety of different formats. These alignments can be
reached via GeneView (for homologues) or from the Compara tracks in
ContigView.
e.g. see here
GeneSNPView
- The table at the bottom of the page now displays the SNP type, AA change, and
AA position for all transcripts of the gene.
FastaView
- Can display data from the core database for mapped Affy identifiers,
including description, locations. and other members of a composite group.
These data are linked to from the mapped Affys on ContigView and GeneView.
ContigView
- New ncRNA tracks for human
- New ENCODE region track
- New tracks for Affy hg_u95b, c, d, and e probesets
Drawing code
- The gene and match tracks have been combined into generic GlyphSets
(generic_gene and generic_match). All genes, and most similarity features
are drawn using these tracks. Adding a new gene/match track is now just a
matter of providing the appropriate configuration.
FTP Changes
- a fasta/RNA directory has been added to hold RNA dumps
- the gene ID has been added to the FASTA headers, e.g.:
>ENSP00000317931 pep:known chromosome:NCBI34:1:801456:802749:-1 gene:ENSG00000177750 transcript:ENST00000326725
>ENST00000327169 cdna:known chromosome:NCBI34:1:407522:408460:1 gene:ENSG00000177799
Availability
The Ensembl FTP site is currently being updated with new copies of
all databases and flatfiles. This should be complete within a day
or so. Your patience is appreciated during this process.
The databases will also be copied to the public MySQL server,
ensembldb.ensembl.org, within the next few days.
|
| More |
 |
Ensembl pre-release: Mouse NCBIm33 |
15th Jul 2004 |
We are pleased to announce the release of the NCBI m33 assembly of the
mouse genome.
Build 33 (freeze May 27, 2004) has undergone extensive QC, principally
from the Sanger Institute. Most of the artefactual assembly issues
introduced in build 32 have been removed. The whole genome N50 is 22.3
Mb (compared to 17.7 Mb from Build 32).
Mouse build 33 represents a composite assembly made by merging HTGS
phase 3 sequence with the Mouse Genome Sequence Consortium v3 Whole
Genome Shotgun Assembly (MGSCv3). The assembly was performed by NCBI
using a 'combined' tiling path that was largely created automatically,
but was manually curated in places. This facilitated placing finished
sequence in the context of the MGSCv3. Draft sequence was not included
in this build as the slight increase in coverage one gains by using this
is offset by the increase in build errors.
As this is a pre-release, the database only contains repeat analysis, ab
initio gene predictions, and BLAST comparisons. The Ensembl gene
prediction pipeline is in progress, and no complete Ensembl gene
predictions are available yet. The annotated assembly will be released
on the main Ensembl site (http://www.ensembl.org/),
currently planned for the start of September 2004.
|
| More |
 |
Ensembl pre-release: Dog genome |
14th Jul 2004 |
We are pleased to announce the availability of the assembly for the Dog (Canis
familiaris) genome.
The Dog genome was sequenced by a consortium led by the Broad Institute
and funded by NIH-NHGRI. It is a 7.6x assembly with a super-contig N50 of
41.6MB and a contig N50 of 123KB. For more information, please go to the Broad web site at
http://www.broad.mit.edu/
Ensembl "Pre" sites give early access to assemblies at mainly the DNA
level for searching, along with some gene structures. In this release Dog cDNAs have
been mapped onto the genome. It is expected that a fully featured Ensembl Dog
site will be available this autumn.
|
| More |
 |
Ensembl version 22 released |
3rd Jun 2004 |
We are pleased to announce the release of Ensembl v22. This release
contains the first release of MultiContigView, a new comparative
genomics display which lets you view simultaneously two or more genomes
which share local order (e.g. Human, Mouse, Rat). For example:
click here.
Release 22 also sees the arrival of the first annotated draft chicken
assembly in Ensembl.
New Data
Chicken (Gallus gallus)
Ensembl 22 presents an annotation of the first draft chicken genome
assembly. The chicken genome sequence was determined by whole
genome shotgun at the Genome Sequencing Center at Washington
University, St Louis. The analysis of the chicken sequence involves
an international group of scientists including individuals from the
US, UK, Europe and China.
A slightly modified Ensembl gene build was run for chicken,
resulting in 17784 genes with 185326 exons. Continuing analysis
suggests that about 10% of the gene content of chicken is absent
from this gene build. Around half of this missing content can be
attributed to representation issues in the whole genome shotgun,
probably due to high GC content regions not being well represented.
The other half of the missing set is poorly represented as one or
two exon assemblies (in particular in chromosome Un) which did not
pass Ensembl's quality for gene structures. This QC level has been
set to avoid spurious pseudogene structures being called as genes.
We are working with our colleagues in the chicken community to
analyse these data further and the analysis group expects to submit
a paper this summer in addition to providing improved data
resources.
The chicken release includes:
- Core database
- EST database
- ESTgene database
Human
The human SNP database has additional information about RefSNPs
that are part of the HapMAP project. The schema of the database has
changed slightly to accomodate this data (see schema changes below).
Multi-species
- Affy probe hits
These data have been added to the core database as "misc features".
New method of mapping affy probe hits to translations has changed some
mappings. This applies to the following databases:
- homo_sapiens_core_22_34d
- mus_musculus_core_22_32b
- rattus_norvegicus_core_22_3b
- danio_rerio_core_22_3b
- RefSeq links
RefSeq mRNA links (NM_ identifiers) have been added for each of
the protein links (NP identifiers) in the following core
databases:
- homo_sapiens_core_22_34d
- mus_musculus_core_22_32b
- rattus_norvegicus_core_22_3b
- drosophila_melanogaster_core_22_3a
- danio_rerio_core_22_3b
- Comparative genomics (ensembl_compara_22_1)
- New chicken genebuild was added to compara for homology and
family analysis
- Honeybee sequence was added to compara and assigned
genome_db_id=12. Honeybee sequence was queried against both
mosquito and fruitfly DNA via translated BLAT. The results are
stored in the genomic_align_block table.
- Family was recalculated so as to include chicken genes and the
latest SWISSPROT and SPTREMBL
- Orthologue analysis was extended so that now all species pairs
have putative orthologues. For cross-phylum analyses (e.g.
mosquito vs C.elegans), only BRH (best reciprocal hit) were
calculated.
- Schema changes (see schema changes below)
- Mart database (ensembl_mart_22_1)
- New build, including chicken
- New table-naming convention
Schema changes
Core
- 2 new tables (translation_attrib & transcript_attrib) added
- these tables will be used for handling exceptional cases in
transcripts/translations, e.g selenocysteins and RNA edits. Data
to populate these tables is still in preparation.
- misc_set table
- The "code" column was expanded from varchar(15) to varchar(25).
SNP
- RefSNP table
- added column "hapmap_snp" to provide a boolean flag indicating
whether this RefSNP has been typed in the HapMap project or not.
Compara
- member table
- added column "chr_strand" which copies Gene and Transcript
strandedness (1 or -1) from the core databases
- genome_db table
- added column "locator" which stores a locator string which
describes how to get a DBAdaptor for the corresponding core
database. It is used in pipeline production, but set to an empty
string for release.
Website Changes
MultiContigView
- We are pleased to announce a new comparative genomics view for
Ensembl: MultiContigView. This page displays simultaneous
contigviews for multiple species, aligned by compara genomic
alignment blocks. e.g.
click here
- You can enter the page on a location (from one species), along
with the name of one or more additional species. The initial
alignment is selected as the best available between the two
species, interpolated from the DNA align features in the compara
database. MultiContigView is linked to from ContigView (via
Compara DNA align features), and GeneView.
- MultiContigView shows lines connecting the dna alignment blocks
along with the gene predictions for each species.
- Navigation is currently available to:
- Change the location/size of all the sequence regions shown by
- clicking on either the top display or overview
- using the input boxes at the top of "Detailed view"
- using the buttons above the detailed view menu
- Change individual species sequence region
- flip species region
- nudge species region up/down stream
- zoom in/out on region
- realign the sequence to the focus sequence
- Change the primary (focus) species
- any of the species shown can be changed to be the primary
species, by clicking on the "P" button.
- This is the initial release of this page, and we are planning a
number of improvements. We would be appreciative of any feedback you
might have, good or bad, about this display.
SNPView
- Now displays a link to the International HapMap Project
(http://www.hapmap.org/) if the SNP has been typed in HapMap.
Site Maps
- The site maps have been reworked, and are now generated dynamically
from the available data. This means they are more up-to-date, more
accurate, and more useful for navigating the site. See, e.g.
http://www.ensembl.org/Homo_sapiens/sitemap/
ContigView
- Affy probe track. The Affymetrix probe hits stored in the core
database are now displayed as a track on ContigView. This track can
be switched on and off via the "Features" menu.
- Transcript tracks are now collapsible. Transcript tracks can be
collapsed down into genes by clicking the red "-" symbol to the left
of the track name.
Ab initio predictions
- Ab initio predictions, such as Genscan, SNAP, etc, are now stored in
the same way to Ensembl genes. This means that these ab initio
transcripts can now be viewed in TransView, ProtView and ExonView,
and exported from ExportView, like Ensembl transcripts.
FTP Site Changes
In this release of Ensembl we have slightly reorganised some of the
files on the FTP site. The "golden_path" data directory has been
merged into the fasta/dna directory, and these files have been given
more consistent and useful names, as follows:
<species>.<version>.<seqtype>.<idtype>.<id>.fa.gz
e.g. for Human, the old 'golden_path' files become;
Homo_sapiens.NCBI34.dna.chromosome.1.fa.gz
etc, for unmasked sequence
Homo_sapiens.NCBI34.dna_rm.chromosome.1.fa.gz
etc, for repeatmasked sequence
In addition, we have added all the "non-chromosomal" sequence to
the dumps; i.e., all the sequence which has not been mapped into the
assembled chromosomes: NT contigs, Unknown chromosomes, etc. This
sequence has been grouped into single files, e.g.
Homo_sapiens.NCBI34.dna.nonchromosomal.fa.gz
Homo_sapiens.NCBI34.dna_rm.nonchromosomal.fa.gz
Finally, the files that used to be in the fasta/dna directory,
which were dumps of the assembly at the sequence level, have also been
renamed. For example:
Homo_sapiens.NCBI34.dna.contig.fa.gz
Anopheles_gambiae.MOZ2a.dna.chunk.fa.gz
Fugu_rubripes.FUGU2.dna.scaffold.fa.gz
Note that the sequence "container" may be different in different
species: contigs in human, chunks in Anopheles, scaffolds in Fugu.
These changes mean we provide a more complete set of sequence
dumps, with more useful names.
Availability
The Ensembl FTP site is currently being updated with new copies of
all databases and flatfiles. This should be complete within a day
or so. Your patience is appreciated during this process.
The databases will also be copied to the public MySQL server,
ensembldb.ensembl.org, within the next few days.
|
| More |
 |
Ensembl version 21 released |
11th May 2004 |
We are pleased to announce the release of Ensembl v21. This release
includes: a major update to the gene buid on human (on the same
assembly, NCBI34), a new gene build on Fugu (again without assembly
change), the first appearance of a gene build on chimpanzee, and
improvements to orthologs in key model organisms. We also now provide
an experimental text mining resource and links to papers of disease
association studies from the genetic association database.
New Data
Homo sapiens
Human 21.34d contains a new gene build on the NCBI34 assembly, with
a modest increase in the number of protein coding genes (from 23,531
to 23,758) but a more significant increase in the number of protein
coding transcripts (from 29,802 to 34,091).
This build is best described as a semi-automated build; it takes
advantage of the Vega manual annotation and makes more effective use
of Uniprot. We have tracked a number of statistics for improvement
and have 3,681 more Met-to-STOP protein coding predictions, presumed
to be complete coding sequence (an increse of 18%) and 1,789 more
predictions with both 5' and 3' UTRs and a complete Met-to-STOP
prediction (an increase of 13%). These statistics are actually
underestimates of the improvements as we have also removed around
1,000 3' UTR clones (probably genomic contamination cloning errors)
which gave rise to single-exon open-reading-frames 3' of
well-annotated genes.
More extensive statistics and discussion of future gene building
plans can be found here.
- Core database
- New genebuild
- PAR/Haplotype data
- SNP database
- Lite database
As a result of these changes, the Human database version has been
bumped to 34d, e.g. homo_sapiens_core_21_34d.
Chimpanzee (Pan troglodytes)
This release presents the first annotated chimp assembly in
Ensembl. The genome used is the 4x shotgun assembly from the
chimpanzee Genome Consortium. This was then aligned to human by
UCSC using blastz. The resulting alignments were then used to
transfer human gene structures (Human Build 34d) to chimpanzee.
The transfer process had to cope with many complications in the
alignment which are primarily due to problems in the chimp
sequence. The quality of the sequence itself is a product of the low
coverage of the chimpanzee genome (4x shotgun is very low in
particular for an outbred organism such as chimp: 4x therefore
represents only 2x on each haplotype). This means that there are
missing areas of the chimp genome, missassemblies, misplacements and
small insertions, deletions and substitutions.
With better data it is expected that nearly every human gene has an
almost identical chimpanzee gene. However, the areas which are
significantly different between chimpanzee and human will be
precisely the areas where there is the largest amount of
uncertainity about the alignment.
This gene build attempted to transfer across human genes wherever
possible. For coding regions the following applies:
- If there was an insertion or deletion error that preserved frame
up to 10 amino acids then this was kept.
- If there was an insertion or deletion error which did not
preserve frame up to 10 base pairs then a small "intron" was
added (in effect modelling a sequence insertion or
deletion).
- If a significant part of a exon (or an entire exon) was missing
then the transcript was broken into two separate transcripts
within the same gene.
- If twhe final transcript had a small ORF at the end of this
process then it was discarded.
It is anticipated that improvements to this system in the future
will be largely driven by work on the chimpanzee genome sequence.
Fugu rubripes build 2c
Fugu build 2c is a rebuild on the Singapore assembly, which
utilises the considerable increase in cDNA and vertebrate protein
evidence which has accumulated since the original gene build. This
improvement in data has allowed the new build to employ more
stringent quality control, in particular focusing more on longer
gene structures. As a consequence the overal gene number has dropped
to 22,089, while the number of orthologs which we can find to
mammals and Zebrafish has remained reasonably constant. We believe
this is a more useful and complete dataset than the previous build.
- Core database
- EST database
- ESTgene database
Multi-species
- Comparative genomics (ensembl_compara_21_1)
Blastz and BLAT now have scores and percent identities calculated
for each alignment. BLAT is also now available for: Fugu and Danio
with mouse, rat and chicken, and chicken with mouse and rat.
The homologue data have been extended to include human, mouse and
rat with all genomes except chimp. However, deep homologues only
have BRH (best reciprocal hit) and not RHS (Reciprocal hit based on
synteny). This means that Ensembl now contains probable orthologous
relationships between mammalian genes and those from the model
organisms Drosophila, Anopheles and the nematode worms.
Chimp orthologues were derived from whole genome alignments(DWGA)
rather than by BRH.
- Mart database (ensembl_mart_21_1)
- New build, including chimp
Schema changes
Compara
- New sequence table, and the addition of a sequence_id in member
(and removal of member.sequence).
- New version column for peptide/gene versions in the member
table.
- New genebuild column in the genome_db table, to allow for
different gene builds off the same assembly.
Website Changes
ContigView
- Haplotype and Pseudo Autosomal Regions (PAR) are now displayed
on ContigView. The chromosome-level display indicates the
long-range position of these assembly features, while the detailed
view colours these regions differently (red for Haps, blue for
PARs). Click on the HAP/PAR track allows you to switch between
the different versions of the assembly.
e.g. http://www.ensembl.org/Homo_sapiens/contigview?l=6:32437499-32637500
- FirstEF track added. This track shows features produced by
FirstEF: a first-exon and promoter prediction program for human
DNA (http://rulai.cshl.org/tools/FirstEF/)
GeneView
- Displays alternative location for genes (e.g. in a PAR/HAP)
- New GeneDAS sources
- HUGO_text : an experimental source of Medline text-mining for HUGO
symbols
- GAD: the Genetic Association Database tracks papers reporting association
studies to around 2,000 disease genes.
Availability
The Ensembl FTP site is currently being updated with new copies of all
databases and flatfiles. This should be complete within a day or so.
Your patience is appreciated during this process.
The databases will also be copied to the public MySQL server,
ensembldb.ensembl.org, within the next few days.
|
| More |
 |
Ensembl version 20 released |
1st Apr 2004 |
The Ensembl Developers are pleased to announce the release of Ensembl 20.
The main focus of this release is an improvement in the underlying
technology of Ensembl, with only minor data updates and visualisation
improvements. One exception is the addition of allele frequency and
genotype data for typed SNPs, which is displayed in SNPView.
This release includes significant changes to the API and schema. Both
the database and API have been extended, with several goals in mind:
- To generalise the way in which assembly and sequence information is
stored in the database so that a variety of genomes could more
easily be accomodated.
- To improve the efficiency of, and reduce some of the complexity of
using the API.
- To allow for the inclusion of some genome anomolies such as pseudo
autosomal regions and structural haplotypes.
The following are the most significant alterations to the perl API:
- The concept of coordinate systems has been introduced. A
CoordSystem object represents a coordinate system and the
CoordSystemAdaptor can be used to retrieve available coordinate
systems from the database.
- The Slice class has been generalised. Formerly a Slice object was
restricted to representing chromosomal regions. A Slice object may
now represent a region in any coordinate system which is in the
database. Accordingly the SliceAdaptor has also been extended so
that it can be used to obtain Slices in any coordinate system.
- The RawContig, Chromosome and Clone classes are deprecated. These
have been replaced by the generalised Slice class. Similarly
RawContigAdaptor, CloneAdaptor and ChromosomeAdaptors have been
replaced by the SliceAdaptor.
- The SeqFeature class is deprecated. This has been replaced by a
simpler Feature class.
- The Protein and ProteinAdaptor classes have been merged with and
replaced by the Translation and TranslationAdaptor classes.
A considerable amount of effort has been made so that many old programs
will continue to work against the new API, albeit with deprecated
messages.
The database schema has undergone similar changes to the API. The
following are the most significant:
- The chromosome, contig and clone tables have been replaced by a
single general table named seq_region.
- The assembly table has been generalised so that it describes the
relationship between arbitrary coordinate systems rather than just
contigs and clones.
- Feature table columns contig_id, contig_start, contig_end,
contig_strand have been replaced by seq_region_start,
seq_region_end, seq_region_strand columns respectively. Features
may now be stored with coordinates in any coordinate system which is
in the database and are no longer restricted to the contig
coordinate system.
- Most data has been removed from the denormalised lite database. The
performance of the core database has been improved and most aspects
of the lite database are no longer needed.
The hope is that all of these alterations will result in long term
benefits for the users of the schema and API, and that they have helped
to make Ensembl a more powerful and flexible system.
For more details of the changes, see: http://www.ensembl.org/Docs/wiki/html/EnsemblDocs/EnsemblCore.html.
Data
There are no new assemblies or gene builds in release 20, but every species
core schema databases have been updated to the new schema.
Anopheles gambiae
- Core database
- 343 proteins removed and their genes re-typed as
"bacterial_contamination"
- SNP database
- updated to use chromosome names 2L/2R/3L/3R instead of 2/3
- Lite database
- now only contains SNP data, see below
As a result of these changes, the Anopheles database version has been bumped to
2b, e.g. anopheles_gambiae_core_20_2b
Caenorhabditis briggsae
No data updates, other than port to new schema. Lite database removed.
Caenorhabditis elegans
- Core database
- missing wormpep_protein and pseudogene xrefs fixed
- Lite database
As a result of these changes, the C.elegans database version has been bumped to
116a, e.g. caenorhabditis_elegans_core_20_116a
Danio rerio
- EST database
- ESTgene database
- Lite database
- now only contains SNP data, see below
As a result of these changes, the Danio database version has been bumped to 3b,
e.g. danio_rerio_core_20_3b
Drosophila melanogaster
No data updates, other than port to new schema. Lite database removed.
Fugu rubripes
- Core database
- prediction transcript strands fixed (from 0 to -1)
- Lite database
As a result of these changes, the Fugu database version has been bumped to 2b,
e.g. fugu_rubripes_core_20_2b
Homo sapiens
- SNP database
- Vega database
- Updated with new Vega release data
- new chromosomes: 9 & 10
- updated chromosomes 13 & 20
- Lite database
- now only contains SNP data, see below
As a result of these changes, the Human database version has been bumped to 34c,
e.g. homo_sapiens_core_20_34c
Mus musculus
- Core database
- tmhmm protein features added
- EST database
- ESTgene database
- SNP database
- Lite database
- now only contains SNP data, see below
As a result of these changes, the Mouse database version has been bumped to 32b,
e.g. mus_musculus_core_20_32b
Rattus norvegicus
No data updates, other than port to new schema. Lite database now only
contains SNPs.
Multi-species
- GO database (ensembl_go_20_1)
- Updated to the Feb 2004 release from geneontology.org
- Compara database (ensembl_compara_20_1)
- New protein clustering and families
- New BLAT alignments from UCSC
- Addition of N, S and LnL data and schema
- Storage of dS cut-off value in the db
- Updated blatsz alignments to take in account GroupId and LevelId
- Corrected synteny data
- Schema changes (see below)
- Mart database (ensembl_mart_20_1)
Schema changes
Core
As mentioned above, details of the new core schema and API can be found
here.
Lite
As a result of the core schema and API changes, the lite database now only
contains SNP information. If a species has no SNP database (e.g. C.briggsae)
that species will also have no lite database.
SNP
- GTInd table added
- holds individual genotype data
- Strain table
- added columns ssid and ind_id
Compara
- genomic_align_block table
- added columns group_id, level_id and flip_alignment
- homology table added
Website Changes
Nearly all of the work on the webcode for this release has been modifying it to
work with the new v20 API. Other changes include:
ContigView
BLAT tracks added for the new Compara data
SNPView
Where available, the following data have been added to the display:
- Allele frequencies, per SubSNP/population/assay
- Strain genotypes
Availability
The Ensembl FTP site is currently being updated with new copies of all
databases and flatfiles. This should be complete within a day or so.
Your patience is appreciated during this process.
The databases will also be copied to the public MySQL server,
ensembldb.ensembl.org, within the next few days.
|
| More |
 |
Ensembl pre-release: Chicken genome |
2nd Mar 2004 |
We are pleased to announce the release of the first draft assembly of
the chicken genome.
The Red Jungle Fowl, Gallus gallus, is the ancestor of the domestic
chicken (Gallus domesticus) and is the first avian genome released. The
genome, with a haploid size of 1.1 Gigabases, was determined by whole
genome shotgun at the Genome Sequencing Center at Washington University,
St Louis. The analysis of the chicken sequence involves an
international group of scientists including individuals from the US, UK,
Europe and China.
http://pre.ensembl.org/ provides
displays of genomes that are in the process of
being annotated. Genomes displayed here have undergone initial BLAST
analysis on the assembly but have not gone through a complete gene
build. These data are provided as an "early access" site for our users.
Repeats have been identified with RepeatMasker (using the latest set of
chicken repeats) and Dust. A set of ab initio gene predictions has been
generated with Genscan and similarities to entries in Swall, Unigene and
EMBL VertRNA identified with Blast. Markers from UniSTS have been placed
onto the assembly with EPCR. Chicken proteins and cDNAs have been mapped
onto the assembly and preliminary gene models created for them.
http://pre.ensembl.org/Gallus_gallus offers browsable chromosomes, and BLAST and SSAHA functionality will be available within twenty-four
hours.
Chromosomal sequences, both masked and unmasked, can be downloaded from
Washington University (http://genome.wustl.edu/projects/chicken/).
|
| More |
 |
Ensembl version 19.2 released |
13th Feb 2004 |
The Ensembl Developers are pleased to announce the second release of Ensembl on schema 19.
New Data
Mouse NCBI m32 release
Mouse 19.32.2 presents the first Ensembl genebuild on the NCBI
build 32 composite mouse assembly. Chromosomes were assembled using
slightly different algorithms depending upon available mapping
date. Chromosomes 2, 4, 5, 7, 11, 15, 18, 19, X and Y were assembled
using a clone based tiling path, with whole genome shotgun sequence
used to fill gaps. Chromosomes 1, 3, 6, 8, 9, 10, 12, 13, 14, 16
and 17 were assembled using the MGSCv3 as a tiling path and
integrating HTGS sequence (both finished and draft) as
appropriate.
Over 90% of the genes with cDNA evidence maintained stable IDs
between the previous release of Ensembl and m32. This number would
have been higher but for some complex duplications, in particular in
olfactory receptor clusters. In these cases the duplication
structure (some of which is known to be due to artefacts in the
mixed clone and WGS assembly) are different between the two
assemblies in difficult to reconcile ways. In these areas we have
had to be conservative and reassign new stable IDs. Elsewhere the
Ensembl gene predictions have mapped very consistently from the old
assembly to the new assembly.
In addition to the new assembly and genebuild, Mouse 19.32.2 contains:
- SNP set from dbSNP 118
- Additional homology data from the Compara database
Zebrafish WGS assembly 3 Release
Zebrafish 19.3.2 features the zebrafish whole genome shotgun
assembly sequence version 3, as released on the 27th November
2003.
In addition to the new assembly and genebuild, Zebrafish 19.3.2
contains:
- SNP set from dbSNP 118
- New mapped EST database
- Additional homology data from the Compara database
C. elegans Wormbase 116 dataset
This release of Ensembl features a direct import of the C.elegans
116 dataset from Wormbase. As usual, no additional genebuild was
carried out, but a series of blast runs was performed to provde
ESTs, SwissProt hits, and other similarity data.
The canonical data for C. elegans is managed at
http://www.wormbase.org/.
Compara
The Compara database has been updated as follows:
- New blastz alignments from UCSC (details at end of mail) and
calculation if synteny based on them, for:
- human/mouse
- human/rat
- mouse/rat
- human/chimp
- New phusion/blastn and calculation if synteny based on them, for:
- New homology data:
- mouse/human -> new dN/dS
- mouse/rat -> new dN/dS
- mouse/fugu
- mouse/zebrafish
- zebrafish/human
- zebrafish/rat
- zebrafish/fugu
- C.elegans/C.briggsae -> new dN/dS
- New family clustering.
New Rat SNPs
The Rattus norvegicus SNP database has been updated to dbSNP 118.
Updated Data
- Fugu rubripes core database has been updated with the addition
of 11782 scaffolds (<2kb) that were previously missing. There is no
new genebuild.
- Homo sapiens core database has been updated with new Affymetrix
xref data. There is no new genebuild.
Schema changes
There are no schema changes in this release.
Website Changes
GeneDAS
With release 19.2 of the web code, GeneDAS and ProteinDas sources
can be added and removed from GeneView and ProtView pages.
See, e.g.
http://www.ensembl.org/Homo_sapiens/geneview?gene=BRCA2
Webcode redesign
Release 19.2 sees the continued rollout of the redesigned
webcode. Updated pages in this release are HaploView and
MarkerView.
HaploView update
The 19.2 webcode incorporates an improved version of HaploView,
contributed by Pedro Gomez-Fabre of GSK. The improvements include
being able to select SNPs involved in a haplotype block and
calculate the minimal set required to distinguish the
haplotypes.
See, e.g.
http://www.ensembl.org/Homo_sapiens/haploview?haplotype=CHR22_A_10
Many thanks to Pedro & GSK for this contribution.
FTP dump change
From this release, pseudogene cDNAs are dumped to a separate FASTA
file from the known and novel cDNA files.
Availability
The Ensembl FTP site is currently being updated with new copies of
all databases and flatfiles. This should be complete within a day
or so. Your patience is appreciated during this process.
The databases will also be copied to the public MySQL server,
ensembldb.ensembl.org, within the next few days.
Please note
The/human, mouse/rat and human/rat blastz alignments stored in
Compara originated from UCSC
mouse NCBIM32 vs human NCBI34
Downloaded from http://genome.ucsc.edu/goldenPath/mm4/vsHg16/
- axtNet directory for "blastz net" track description at UCSC
here
- axtTight directory for "blastz net tight" track description at UCSC
here
mouse NCBI32 vs rat RGSC3.1
Downloaded from http://genome.ucsc.edu/goldenPath/mm4/vsRn3/
- axtNet directory for "blastz net" track description at UCSC
here
- axtTight directory for "blastz net tight" track description at UCSC
here
human NCBI34 vs rat RGSC3.1
Downloaded from http://genome.ucsc.edu/goldenPath/hg16/vsRn3/
- axtNet directory for "blastz net" track description at UCSC
here
- axtTight directory for "blastz net tight" no track description at UCSC (basically the same as for mouse/human axtTight)
human NCBI34 vs chimp BROAD1
Downloaded from http://genome.ucsc.edu/goldenPath/hg16/vsPt0/
- axtBest directory for "blastz recip net" track description at UCSC
here
|
| More |
 |
Ensembl version 19 released |
17th Dec 2003 |
The Ensembl Developers are pleased to announce the release of Ensembl 19.
New Data
Human Build 34a Gene Update
Ensembl v19 contains an updated gene build on the NCBI34 assembly
released in v18. The majority of gene structures from build 34 are
unchanged, but 200-300 are improved predictions produced as a result
of corrections in our gene building software. The build contains 23531
gene predictions with 31609 transcripts, including 1744 pseudogenes.
We are continuing to improve our prediction methods and expect to
produce another NCBI34 rebuild early in 2004.
The 34a version increment indicates a data update without a change in
assembly.
In addition to the updated genebuild, Human 19.34a.1 contains:
- Improved archived identifier data in the core database
- Filtered SNP set from dbSNP 117. The new SNP database contains 276000
fewer SNPs, as these were mapped to the alternate HSC_TCAG which is
not represented in Ensembl
- Additional homology data from the Compara database
Rat 3a ESTGene update
Due to an uncaught error in database production, the Rat ESTgenes
in the last (v18) release of Ensembl erroneously shared IDs with
Ensembl genes. This error has been corrected for release v19 which
contains updated estgene and lite databases. The Ensembl Mart
database also contains the corrected data.
Compara
The v19 Compara database has been updated using the new human
geneset data. This includes new protein family clustering and new
homologous gene pairs implicating human genes.
v19 Compara also contains the new addition of dN and dS values for
each homologous gene pair. These data are displayed on the GeneView
pages, and can be retrieved via EnsMart.
Schema changes
- Compara
- addition of two columns, "dn" and "ds" in the homology table
These columns are filled only for some paired species,
i.e. human/mouse, human/rat, mouse/rat and elegans/briggsae. For the
other paired species, the ds values obtained were saturated and were
viewed as unreliable. For those cases, the ds and dn columns are NULL.
See the end of this release note for details of how dN and dS
values were calculated.
Website Changes
GeneDAS
This release sees a further extension and incorporation of the DAS
protocol into Ensembl. For a long time DAS has been used to include
external annotations, including user data, on ContigView displays.
This concept has now been applied to the GeneView display, enabling
external annotations on specific genes to be incorporated into the
page.
The first dataset to be included is SwissProt literature references
for genes. This is only the beginning for GeneDAS, and we are working
on providing a number of data sources, as well as enabling users to
display their own gene annotations.
See, e.g. http://www.ensembl.org/Homo_sapiens/geneview?gene=BRCA2
GeneSNPView
GeneSNPView is a new gene-centric SNP display. This page, linked
from GeneView, shows details of SNPs and Pfam domains in, or close to,
the exons of the transcripts of a particular gene. The SNP data
includes their location, alleles, classification, and effects on the
different transcripts.
See, e.g. http://www.ensembl.org/Homo_sapiens/genesnpview?gene=ENSG00000006756
ChromoView data display
A new "tool" page has been added this release. ChromoView enables
the customisable display of feature frequency data against a single
chromosome or a karyotype. The data can be provided by cut-and-paste,
by file upload, or by providing a URL of a datafile. ChromoView
accepts a variety of formats, including simple tab- or
space-delimited, PSL, BED and GFF.
ChromoView is available for all species with an assembly mapped to
chromosomes - e.g. http://www.ensembl.org/Homo_sapiens/chromoview
Please let us know if you find it useful, or have suggestions for
improvements.
GeneView Orthologue display
GeneView now displays further details from the Compara database.
Orthologues now show how they were selected (Best Reciprocal Hit or
Reciprocal Hit based on Synteny around BRH) and give a value for dN/dS
where available.
ContigView Tracks
- New tracks
- Anopheles Gap track (Decorations menu, Gaps). This track
displays the location of gaps both between and within
scaffolds.
- Updated tracks
- Ensembl & Vega transcript tracks now show include the type of
transcript in the labels below each transcript. This can be
switched off by selecting "Concise labels" in the Decorations
menu.
- The BLAST hit track now shows additional match details.
- New DAS sources
- OMIM Disease Phenotypes
- Chimpanzee Contig alignments
- Chimpanzee BAC pairs
- NCBI Gnomon gene predictions
- Rfam RNA gene predictions
Webcode redesign
Release 19 sees the continued rollout of the redesigned webcode (http://www.ensembl.org/Docs/wiki/html/EnsemblDocs/WebcodeRedesign.html).
Updated pages in this release are FASTAView, ExonView, MapView,
AnchorView and SNPView.
MartView updates
- Number of transcripts per gene is available for export and filtering.
- Exons are flagged as constitutive or alternative depending on their
presence or not in all the transcripts of a gene.
- New data available on orthologous gene relationships
- protein stable ids involved in the match
- %identity, %coverage, and %positivity of the match
- In addition for human<->mouse<->rat and elegans<->briggsae
comparisons the dn and ds values are shown where ds < median*2
based on the whole paired species set.
Availability
The Ensembl FTP site is currently being updated with new copies of
all databases and flatfiles. This should be complete within a day or
so. Your patience is appreciated during this process.
The databases will also be copied to the public MySQL server,
ensembldb.ensembl.org, within the next few days.
As is customary, there will be no release at the beginning of January.
As a result, the next release of Ensembl is scheduled for 2 February
2004.
dN/dS calculation details
dN and dS values were generated using the codeml program included
in the PAML package (http://abacus.gene.ucl.ac.uk/software/paml.html)
(Ref 1). With the parameters we have used, codeml performs pairwise
Maximum Likelihood calculations of dN and dS for each set of
orthologs. We have used the F3x4 codon evolution model (Ref 2). This
takes into account both the bias deriving from the different
probabilities of transition (T<->C and A<->G) versus transversion
(T/C<->A/G) mutations, and the bias due to different nucleotide
frequencies at the three codon positions.
dN and dS values are only provided for orthologues from some
species pairs, i.e. human/mouse, human/rat, mouse/rat and
elegans/briggsae. Orthologs for other species pairs are too
divergent for dS to be an accurate measure. Most synonymous sites
will have be subjected to more than one mutation and ancestral
changes cannot be reliably inferred from extant sequences, (i.e. dS
is saturated).
Orthology predictions for human/mouse, human/rat, mouse/rat and
elegans/briggsae may not be perfect. Incorrect assignments will
manifest anomalously high dS values. We have, therefore, applied a
cutoff of twice the median value of all dS for each species pair as
the criterion for displaying the dN/dS ratio. Predicted orthology
relationships with dS above this threshold are likely to be
errors. (This filter has been used successfully for the mouse and
rat genome analysis papers and was suggested by Chris Ponting's
group in Oxford).
Here are the dS threshold values used
dS threshold
human/mouse 1.26775
human/rat 1.27342
mouse/rat 0.41278
elegans/brigssae 4.53168
For example, for human/mouse orthologues, the dN/dS ratio is displayed
only when dS<=1.26775
NB: some may consider that elegans/briggsae dS values have to be
considered as saturated (median = 2.26584, much > 1). After applying
the dS threshold (4.53168), the median and average on the remaining
set were 1.93962 and 2.062 respectively, and comparable with the
data was published in the C. briggsae genome paper (Ref 3), average
1.78 (no median was provided in the paper).
- Yang, Z. (1997) "PAML: a program package for phylogenetic analysis by
maximum likehood." Comput. Appl. Biosci. 1997 13: 555-556.
- Goldman, N. & Zang, Y. (1994) "A codon-based model of nucleotide
substitution for protein-coding DNA sequences." Mol. Biol. Evol. 11,
725-736.
- Stein, LD et al. (2003) "The Genome Sequence of Caenorhabditis
briggsae: A Platform for Comparative Genomics."
PLoS Biol. 1, 166-192.
|
| More |
 |
Chimp Pre Ensembl site available |
11th Dec 2003 |
We have released the recently deposited chimpanzee assembly on the
pre site at the Pre Ensembl site.
The pre site provides browsability of the genome with intial gene
structures determined by matching Swissprot and RefSeq to the chimpanzee.
BLAST and SSAHA searches are also enabled and download of specific
regions. However a full gene build (with peptide dumps etc) is not yet
available.
We hope to release a fully annotated chimpanzee assembly in early 2004,
but the timelines will depend on how the data processing goes and as this
is the first time we have processed a very near species pair we can't set
a firm schedule.
The sequence of the chimpanzee, Pan troglodytes, was assembled by
NHGRI-funded teams led by Eric Lander, Ph.D., at The Eli & Edythe L. Broad
Institute of the Massachusetts Institute of Technology and Harvard
University, Cambridge, Mass., USA; and Richard K. Wilson, Ph.D., at the Genome
Sequencing Center, Washington University School of Medicine, Saint Louis, USA.
We are using the nominated Arachne assembly from the sequencing group.
From their release notes:
This assembly is a merge of a "modified de novo" (MDN) assembly from the
ARACHNE group, which used the human genome to establish that particular
inserts were not chimeric, and a separate "validated chimp-on-human" (VCH)
assembly which took the chimp reads which align uniquely to human, formed
them into contigs via this alignment, and removed contigs which failed a
two-haplotype consistency check.
Shared reads were used to align the two assemblies to each other, and
where they were consistent, the VCH sequence was transferred to the MDN,
resulting in the released merged assembly.
This release of the assembly has the following properties: 361782 contigs,
having N50 length 15.7 kb contig length total 2.73 Gb, spanning 3.02 Gb
37849 supercontigs, having N50 length 8.6 Mb (not including gaps).
|
| More |
 |
Zv3 pre-ensembl |
27th Nov 2003 |
A Zv3 pre-ensembl has been released today. It comprises the sequence of the first zebrafish assembly that could be tied to the fpc map. The full ensembl database including a gene build is scheduled for release in February 2004.
|
| More |
 |
Mus musculus pre-release website is now available |
21st Nov 2003 |
The Mus musculus pre-release website site shows only the DNA sequence, initial gene placement, repeatmasking and raw BLAST hits on this genome. We are currently
working on providing a fully-featured Ensembl build, including a full gene
build, cross-references to other datasets and data mining interface. The
annotated assembly will be released on the main Ensembl site
(http://www.ensembl.org/) site at the start of Feb 2004.
Browsable annotations include
- hits to EMBL vertebrate mRNA sequences
- hits to GenBank Unigene clusters
- hits to SWISS-PROT, TrEMBL and RefSeq proteins
- ab-initio gene predictions from genscan analyses
Other functionality includes
- BLAST/SSAHA sequence similarity searches for assembly and
genscan predictions
- export of genomic regions in Fasta, EMBL, GenBank, GFF, text and image
formats
|
| More |
 |
Ensembl version 18 released |
5th Nov 2003 |
The Ensembl Developers are pleased to announce the release of Ensembl 18.
New Data
Human Build 34
This release contains the Ensembl gene build on human assembly 34. Build
34 is an update to the finished human genome, with a number of small
improvements in genome sequence on a number of chromosomes.
This release has 22,184 genes comprising 27,941 coding transcripts and
1853 pseudogenes which are easily confused with genes. It is expected
that there are between 20,000 to 30,000 pseudogenes in the genome;
Ensembl only currently annotates those which confuse the gene prediction
process.
Depending on the precise estimates of total protein gene number, Ensembl
has annotated between 80% to 90% of all protein coding genes. 92% of
genes from the previous build transferred across to the new build, with
the missing 8% of genes predominantly being inappropriate protein coding
genes (e.g. coming from large scale cDNA projects which have a number of
artefactual errors, or from chimeric cDNA clones from cancer cell
lines). However, a very small number (around 1%) were clearly
"correct" genes which were misclassified as artefactual errors. The
next release of Ensembl (due December 15th) will update the data for
this very small number of genes.
In addition to the new assembly and genebuild, Human 18.34.1 contains:
- new EST database
- new Vega annotation on NCBI34
- new ESTgene database
- new SNP data from dbSNP 117 mapped to NCBI34
Rat assembly 3
This release also contains the Ensembl gene build on rat build 3.1.
Build 3.1 is a draft genome assembly covering more than 90% of the
estimated 2.8 Gb genome.
This release has 22,159 genes comprising 28,545 transcripts, and 1,592
pseudogenes. Using an estimate of 26,000-29,000 protein coding genes,
Ensembl has annotated around 75-85% of the total.
76% of genes from the previous build transferred across to the new build,
with the majority of genes which have been missed primarily being due to
assembly changes.
In addition to the new assembly and genebuild, Rat 18.3.1 contains:
- new EST database
- new ESTgene database
- new SNP data from dbSNP 117 mapped to the new assembly
Compara
- Updated for the above new species data. In addition, in this release,
the peptide family data has been merged into compara, and no longer has
its own database. This release of compara, then, contains newly
computed peptide families and multiple alignments of family members,
using latest SWISSPROT,SPTREMBL and EnsEMBL metazoan peptide sets.
GO
- The Ensembl GO database has been updated to the latest, October 2003,
release.
Schema & API Changes
- Compara
- database contains new tables for family data
- compara API includes family objects
- Core
- two columns renamed in identity_xref table
- hit_start -> query_start
- hit_end -> query_end
Website Improvements
Webcode redesign
In order to improve reusability and maintainability, the webteam have
redesigned the general scheme for the webcode. This release sees the
first pages (GeneView, TransView, ProtView) implemented under the new
design. There should be little or no visible difference to the pages;
all the changes are behind the scenes.
For more details on the new design, see:
here
KaryoView data display
A new "tool" page has been added this release. KaryoView enables the
customisable display of user data on a karyotype, and is intended to
provide images for use in presentations, papers, etc. KaryoView will
work on all species with an assembly and a karyotype - e.g.
http://www.ensembl.org/Homo_sapiens/karyoview
Please let us know if you find it useful, or have suggestions for
improvements.
MartView updates
- Genes can now be filtered on whether they have a 5' or 3' UTR
Availability
The Ensembl FTP site is currently being updated with new copies of all
databases and flatfiles. This should be complete within a day. Your
patience is appreciated during this process.
The databases will be copied to the public MySQL server,
ensembldb.ensembl.org, within the next few days.
|
| More |
 |
Ensembl version 17 released |
3rd Oct 2003 |
The Ensembl Developers are pleased to announce the release of Ensembl 17.
Updated Data
This release is mostly a data/schema update. There are no new
assemblies, and only one new gene-build.
Anopheles gambiae
There has been a re-annotation and gene-build of the version 2 assembly.
The new data are:
- New SNAP gene predictions which have been specifically trained
for the Anopheles data set.
- New EST mappings including a new gene build with ESTs.
- Added tRNA using tRNAscan-SE 1.23, CPG islands, Tandem repeats.
- An improved set of repeats from RepeatMasker.
- New BlastX hits of SWALL and Drosphila peptides.
- Preliminary version of the transposon submission tool.
Homo sapiens
Mus musculus
Schema Changes
- Core database
- cigar line and associated information added to identity_xref table
- SNP database (mouse/human only)
- additional moltype column added to RefSNP table
Website Improvements
MartView updates
- New Multispecies focus
- Simplified Affymetrix IDs
- Prototype of the command line interface to Mart - MartShell
See the EnsMart homepage for further details of these updates.
Availability
The Ensembl FTP site has been updated with new copies of all databases
and flatfiles.
The databases will be copied to the public MySQL server,
ensembldb.ensembl.org, within the next few days.
|
| More |
 |
Ensembl version 16 released |
4th Aug 2003 |
The Ensembl Developers are pleased to announce the release of Ensembl 16.
Updated Data
This release is mostly a data/schema update. There are no new
assemblies, and no new genebuilds.
Homo sapiens
- New (correct) Karyotype lengths obtained from UCSC and imported (core & Vega)
- New GO evidence tags associated with go xrefs (core)
- Gene descriptions loaded into Vega (Vega)
- Many superflous features refencing non-existant contigs removed (core & Vega)
Danio rerio
- missing gene_decripions added
- corrected analysis.gff_feature and analysis.gff_source values for some
protein features
Fugu rubripes
- Many incorrect locuslink xrefs are now correctly labelled as SwissProt xrefs
Compara
- New compara database generated with improved blastp parameters and
improved best-reciprocal-hit putative orthologue analysis
Schema Changes
- Core database
- go_xref table definition altered
- type column added to stable_id_event table
Website Improvements
New BLAST/SSAHA interface
A new interface, along the lines of Martview, has been implemented for
this release. The new page provides access to both BLAST and SSAHA, can
search against multiple species, and accepts multiple query sequences.
Under the new interface is a complete redesign of the Blast/SSAHA
submission and parsing code, intended to make the page readily
extensible to new search tools and output formats, and eventually easier
to install locally.
Multiple Protein Alignments
The Ensembl protein family database contains alignments for members of
all but the largest protein families. These can now be exported from the
familyview pages in a variety of formats (FASTA, MSF, ClustalW, etc).
MartView updates
See the EnsMart homepage for details of Mart updates.
Availability
The Ensembl FTP site has been updated with new copies of all databases
and flatfiles. The databases will be copied to kaka within the next few days.
|
| More |
 |
Ensembl version 15 released |
4th Jul 2003 |
The Ensembl Developers are pleased to announce the release of Ensembl 15.
New Data
Finished Human Assembly
Ensembl human release 15.33.1 is built around the NCBI 33 genome
assembly. This is the first 'essentially complete' assembly of the
human genome and covers about 99 percent of the euchromatic sequence
with less than 400 gaps. The average size of contiguous sequence is
now over 27Mb which is more than 300 times longer than the working
draft produced in 2000. The ensembl annotation consists of 23,299
protein coding genes (30,035 transcripts) and, for the first time,
automatic annotation of 962 pseudogenes.
In addition to the new assembly and genebuild, Human 15.33.1 contains:
- new EST database on NCBI33
- new Vega annotation on NCBI33
- new ESTgene database on NCBI33
- new SNP data from dbSNP 115 mapped to NCBI33
- new stable_id archive data
Caenorhabditis elegans
New assembly and annotation imported from wormbase (release 102)
Drosophila melanogaster
New assembly information and annotation imported from flybase (v. 3.1)
Danio rerio
- new interpro data
- three more analysis: trf, eponine, BAC-end matches
- improved xref mapping
- trimmed out low-scoring blast hits
- new SNP database containing data from dbSNP 115
- new stable_id archive data
Compara
Updated for the above new species data
GO
The Ensembl GO database has been updated to the latest, May 2003,
release.
Family
newly computed peptide families and multiple alignments of family
members, using latest SWISSPROT,SPTREMBL and EnsEMBL metazoan peptide
sets.
Schema & API Changes
Core database have a new table : go_xref
Website Improvements
New BLAST/SSAHA interface
A new interface, along the lines of Martview, has been implemented for
this release. The new page provides access to both BLAST and SSAHA, can
search against multiple species, and accepts multiple query sequences.
Under the new interface is a complete redesign of the Blast/SSAHA
submission and parsing code, intended to make the page readily
extensible to new search tools and output formats, and eventually easier
to install locally.
Stable ID archive
Ensembl identifiers which are no longer active should now be recognised
by web pages such as geneview.
Multiple Protein Alignments
The Ensembl protein family database contains alignments for members of
all but the largest protein families. These can now be displayed from
familyview pages, using the JalView java multiple alignment editor.
URL-based data upload tracks
This release sees an implementation of UCSC-style URL-based remote data
annotation, allowing custom data tracks to be displayed without the need
to set up or configure a DAS server.
Searchable Mail Archive
The Ensembl mailing lists are now archived at mailarchive.sanger.ac.uk,
with a new search interface.
MartView updates
See the EnsMart homepage for details of Mart updates
Availability
The Ensembl FTP site is currently being updated with new copies of all
databases and flatfiles. This should be complete within a day. Your
patience is appreciated during this process.
The databases are copied to kaka. Please note that this release
also sees a new fasta file naming convention as follows:
<species_name>.<assembly_name>.<file_content_type>.fa
For example the files:
Homo_sapiens.cdna.known.fa
Mus_musculus.latestgp.fa
Ratus_norvegicus.pep.genscan.fa
have now become:
Homo_sapiens.NCBI33.cdna_known.fa
Mus_musculus.NCBIM30.contig.fa
Rattus_norvegicus.RGSC2.pep_genscan.fa
|
| More |
 |
Ensembl version 14 released |
3rd Jun 2003 |
We are pleased to announce the release of Ensembl 14.
New Data
| |