BioperlOverview



Bioperl Overview

http://bioperl.org/

v.1.1.1 29 Jan 2004, Brian Osborne brian_osborne@cognia.com

v.1.1 24 June 2003, Heikki Lehväslaiho heikki@ebi.ac.uk

v.1.0 23 October 2002, Jason Stajich jason@bioperl.org


What is Bioperl?

  • A collection of Perl modules for processing data for the life sciences
  • A project made up of biologists, bioinformaticians, computer scientists
  • An open source toolkit of building blocks for life sciences applications
  • Supported by Open Bioinformatics Foundation (O|B|F), http://www.open-bio.org/
  • Collaborative online community
  • Current version 1.4

Simple example

#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;
my $in = new Bio::SeqIO(-format => 'genbank',
                        -file => 'AB077698.gb');
while ( my $seq = $in->next_seq ) { 
    print "Sequence length is ", $seq->length(), "\n";
    my $sequence = $seq->seq();
    print "1st ATG is at ", index($sequence,'ATG')+1, "\n";
    print "features are: \n";
    foreach my $f ( $seq->top_SeqFeatures ) {
        printf("  %s %s(%s..%s)\n",
               $f->primary_tag,
               $f->strand < 0 ? 'complement' : '',
               $f->start,
               $f->end);
    }
}

Simple example, output

% perl ex1.pl 
Sequence length is 2701 
1st ATG is at 80 
features are: 
  source (1..2701)
  gene (1..2701)
  5'UTR (1..79)
  CDS (80..1144)
  misc_feature (137..196)
  misc_feature (239..292)
  misc_feature (617..676)
  misc_feature (725..778)
  3'UTR (1145..2659)
  polyA_site (1606..1606)
  polyA_site (2660..2660)

Gotchas

  • Sequences start with 1 in Bioperl (historical reasons). In perl strings, arrays, etc start with 0.
  • When using a module, CaseMatTers.
    • methods are usually lower case with underscores (_).
  • Make sure you know what you're getting back - if you get back an array, don't assign it to a scalar in haste.
my ($val) = $obj->get_array(); # 1st item
my @vals  = $obj->get_array(); # whole list
my $val   = $obj->get_array(); # array length

Where to go for help


Brief Object-Oriented Overview

  • Break problem into components
  • Each component has data (state) and methods
  • Only interact with component through methods
  • Interface versus implementations

Inheritance

  • Objects inherit methods from their parent
  • They inherit state (data members); not explicitly in Perl.
  • Methods can be overridden by children

Interfaces

  • Interfaces can be thought of as an agreement
  • Object will at least look a certain way
  • It is independent of what goes on under the hood

Interfaces and Inheritance in Bioperl

  • What you need to know:
    • Interfaces are declared with trailing 'I' (Bio::PrimarySeqI)
    • Can be assured that at least these methods will be implemented by subclasses
    • Can treat all inheriting objects as if they were the same, i.e. Bio::PrimarySeq, Bio::Seq, Bio::Seq::RichSeq all have basic Bio::PrimarySeqI methods.
  • In Perl, good OO requires good manners.
  • Methods which start with an underscore are considered 'private'
  • Watch out. Perl programmers can cheat.

Examples of modular programming

From Stein et al. Genome Research 2002


Examples of modular programming II


Bioperl components


Sequence components I

  • Sequences
    • Bio::PrimarySeq - Basic sequence operations (aa and nt)
    • Bio::Seq - Supports attached features
    • Bio::Seq::RichSeq - GenBank,EMBL,SwissProt fields
    • Bio::LocatableSeq - subsequences
    • Bio::Seq::Meta - residue annotation

Sequence components II

  • Features
    • Bio::SeqFeature::Generic - Basic Sequence features
    • Bio::SeqFeature::Similarity - Represent similarity info
    • Bio::SeqFeature::FeaturePair - Paired features (HSPs)
  • Sequence Input: Bio::SeqIO
  • Annotation: Bio::Annotation::XX objects

Class diagram (subset)

From Stajich et al. Genome Research 2002


Build a sequence and translate it

#!/usr/bin/perl -w
use strict;
use Bio::PrimarySeq;
my $seq = new Bio::PrimarySeq(-seq => 'ATGGGACCAAGTA', 
                              -display_id => 'example1');
print "seq length is ", $seq->length, "\n";
print "translation is ", $seq->translate()->seq(), "\n";

% perl ex2.pl
seq length is 13
translation is MGPS

Bio::PrimarySeq I

  • Initialization
    • -seq - sequence string
    • -display_id - sequence ID (i.e. >ID DESCRIPTION)
    • -desc - description
    • -accession_number - accession number
    • -alphabet - alphabet (dna,rna,protein)
    • -is_circular - is a circular sequence (boolean)
    • -primary_id - primary ID (like GI number)

Bio::PrimarySeq II

  • Essential methods
    • length - return the length of the sequence
    • seq - get/set the sequence string
    • desc - get/set the description string
    • display_id - get/set the display id string
    • alphabet - get/set the sequence alphabet
    • subseq - get a sub-sequence as a string
    • trunc - get a sub-sequence as an object
  • Methods only for nucleotide sequences
    • translate - get the protein translation
    • revcom - get the reverse complement

Bio::Seq

  • Initialization
    • annotation - Bio::AnnotationCollectionI object
    • features - array ref of Bio::SeqFeatureI objects
    • species - Bio::Species object
  • Essential methods
    • species - get/set the Bio::Species object
    • annotation - get/set the Bio::AnnotationCollectionI object
    • add_SeqFeature - attach a Bio::SeqFeatureI object to Seq
    • flush_SeqFeatures - remove all features
    • top_SeqFeatures - Get all the toplevel features
    • all_SeqFeatures - Get all features flattening those which contain sub-features (rare now).
    • feature_count - Get the number of features attached

Parse a sequence from file

# ex3.pl
use Bio::SeqIO;
my $in = new Bio::SeqIO(-format => 'swiss',
                        -file => 'BOSS_DROME.sp');
my $seq = $in->next_seq();
my $species = $seq->species;
print "Organism name: ", $species->common_name, " ",
    "(", $species->genus, " ", $species->species, ")\n";
my ($ref1) = $seq->annotation->get_Annotations('reference');
print $ref1->authors,"\n";
foreach my $feature ( $seq->top_SeqFeatures ) {
    print $feature->start, " ",$feature->end, " ",
          $feature->primary_tag, "\n";
}

Parse a sequence from file, output

% perl ex3.pl Organism name: Fruit fly (Drosophila melanogaster)
Hart A.C., Kraemer H., van Vactor D.L. Jr., Paidhungat M., Zipursky
1 31 SIGNAL 
32 896 CHAIN 
32 530 DOMAIN 
531 554 TRANSMEM 
570 588 TRANSMEM 
615 637 TRANSMEM 
655 676 TRANSMEM 
693 712 TRANSMEM 
728 748 TRANSMEM 
759 781 TRANSMEM 
782 896 DOMAIN
 ...

Bio::SeqIO

  • Can read sequence from a file or a filehandle
    • special trick to read from a string: use IO::String
  • Initialize
    • -file - filename for input (prepend > for output files)
    • -fh - filehandle for reading or writing
    • -format - format for reading writing
  • Some supported formats:
    • genbank, embl, swiss, fasta, raw, gcg, scf, bsml, game, tab

Read in sequence and write out in different format

# ex4.pl
use Bio::SeqIO;
my $in = new Bio::SeqIO(-format => 'genbank',
                        -file => 'in.gb');
my $out = new Bio::SeqIO(-format => 'fasta',
                         -file =>'>out.fa');
while ( my $seq = $in->next_seq ) {
    next unless $seq->desc =~ /hypothetical/i;
    $out->write_seq($seq);
}

Sequence Features: Bio::SeqFeatureI

  • Basic sequence features - have a location in sequence
  • primary_tag, source_tag, score, frame
  • additional tag/value pairs
  • Subclasses by numerous objects - power of the interface!

Sequence Features: Bio::SeqFeature::Generic

  • Initialize
    • -start, -end, -strand
    • -frame - frame
    • -score - score
    • -tag - hash reference of tag/values
    • -primary - primary tag name
    • -source - source of the feature (e.g. program)
  • Essential methods
    • primary_tag, source_tag, start,end,strand, frame
    • add_tag_value, get_tag_values, remove_tag, has_tag

Locations quandary

  • How to manage features that span more than just start/end
  • Solution: An interface Bio::LocationI, and implementations in Bio::Location
    • Bio::Location::Simple - default: 234, 39^40
    • Bio::Location::Split - multiple locations (join,order)
    • Bio::Location::Fuzzy - (<1..30, 80..>900)
  • Each sequence feature has a location() method to get access to this object.

Create a sequence and a feature

use Bio::Seq;
use Bio::SeqFeature::Generic;
use Bio::SeqIO;
my $seq = Bio::Seq->new(-seq => 'STTDDEVVATGLTAAILGLIATLAILVFIVV',
                        -display_id => 'BOSSfragment',
                        -desc => 'pep frag');
my $f = Bio::SeqFeature::Generic->new(-seq_id => 'BOSSfragment',
                                      -start => 7, -end => 22,
                                      -primary => 'TRANSMEMBRANE',
                                      -source => 'hand_curated',
                                      -tag => {'note' => 'putative transmembrane'});
$seq->add_SeqFeature($f);
my $out = new Bio::SeqIO(-format => 'genbank');
$out->write_seq($seq);

Create a sequence and a feature, output

% perl ex5.pl
LOCUS      BOSSfragment        34 aa         linear           UNK
DEFINITION pep frag
ACCESSION  unknown
FEATURES            Location/Qualifiers
    TRANSMEMBRANE   10..25
                   /note="putative transmembrane"
ORIGIN
       1 tvasttddev vatgltaail gliatlailv fivv
//

Sequence Databases

  • Remote databases
    • GenBank, GenPept, EMBL, SwissProt - Bio::DB::XX
  • Local databases
    • local Fasta - Bio::Index::Fasta, Bio::DB::Fasta
    • local Genbank,EMBL,SwissProt - Bio::Index::XX
    • local alignments - Bio::Index::Blast, Bio::Index::SwissPfam
  • SQL dbs
    • Bio::DB::GFF
    • Bio::DB::BioSeqDatabases (through bioperl-db pkg)

Retrieve sequences from a database

# ex6.pl
use Bio::DB::GenBank;
use Bio::DB::SwissProt;
use Bio::DB::GenPept;
use Bio::DB::EMBL;
use Bio::SeqIO;
my $out = new Bio::SeqIO(-file => ">remote_seqs.embl",
                         -format => 'embl');
my $db = new Bio::DB::SwissProt();
my $seq = $db->get_Seq_by_acc('7LES_DROME');
$out->write_seq($seq);
$db = new Bio::DB::GenBank();
$seq = $db->get_Seq_by_acc('AF012924');
$out->write_seq($seq);
$db = new Bio::DB::GenPept();
$seq = $db->get_Seq_by_acc('CAD35755');
$out->write_seq($seq);

The Open Biological Database Access (OBDA) System

  • cross-platform, database independent
  • implemented in Bioperl, Biopython, Biojava, Bioruby
  • database access controlled by registry file(s)
    • global or user's own
  • the default registry retrieved over the web
  • Database types implemented:
    • flat - Bio::Index
    • biosql
    • biofetch - Bio::DB
  • more: http://www.bioperl.org/HOWTOs/html/OBDA_Access.html

Retrieve sequences using OBDA

# ex7.pl
use Bio::DB::Registry 1.2;# needs bioperl release 1.2.2 or later
my $registry = Bio::DB::Registry->new;
# $registry->services
my $db = $registry->get_database('embl');
# get_Seq_by_{id|acc|version}
my $seq = $db->get_Seq_by_acc("J02231");
print $seq->seq,"\n";

Alignments


Alignment components

  • Pairwise Alignments
    • Bio::SearchIO - Parser
    • Bio::Search::XX - Data Objects
    • Bio::SeqFeature::SimilarityPair
  • Multiple Seq Alignments
    • Bio::AlignIO - Parser
    • Bio::SimpleAlign - Data Object
  • Old Aligment tools
    • Bio::Tools::BPlite - Bio::Tools::Blast
    • Deprecated - Bio::Tools::HMMER

Multiple Sequence Alignments

# usage: convert_aln.pl < in.aln > out.phy
use Bio::AlignIO;
my $in = new Bio::AlignIO(-format => 'clustalw');
my $out = new Bio::AlignIO(-format => 'phylip');
while( my $aln = $in->next_aln ) {
    $out->write_aln($aln);
}

BLAST/FASTA/HMMER Parsing

  • Can be split into 3 components
    1. Result - one per query, associated db stats and run parameters
    2. Hit - Sequence which matches query
    3. HSP - High Scoring Segment Pairs. Components of the Hit which match the query.
  • Corresponding object types in the Bio::Search namespace
  • Implemented for BLAST, FASTA, HMMER

Parse a BLAST & FASTA report

# ex8.pl
use Bio::SearchIO;
use Math::BigFloat;
my $cutoff = Math::BigFloat->new('0.001');
my %files = ( 'blast' => 'BOSS_Ce.BLASTP',
              'fasta' => 'BOSS_Ce.FASTA');
while( my ($format,$file) = each %files ) {
  my $in = new Bio::SearchIO(-format => $format,
                             -file => $file);
  while( my $r = $in->next_result ) {
    print "Query is: ", $r->query_name, " ",
          $r->query_description," ",$r->query_length," aa\n";
    print " Matrix was ", $r->get_parameter('matrix'), "\n";
    while( my $h = $r->next_hit ) {
      last unless Math::BigFloat->new($h->significance) < $cutoff;
      print "Hit is ", $h->name, "\n";
      while( my $hsp = $h->next_hsp ) {
        print " HSP Len is ", $hsp->length('total'), " ",
              " E-value is ", $hsp->evalue, " Bit score ", $hsp->score, " \n",
              " Query loc: ",$hsp->query->start, " ", $hsp->query->end," ",
              " Sbject loc: ",$hsp->hit->start, " ", $hsp->hit->end,"\n";
      }
    }
    print "--\n";
  }
}

Parse a BLAST & FASTA report, output

% perl ex7.pl
Query is: BOSS_DROME Bride of sevenless protein precursor. 896 aa
Matrix was BL50
Hit is F35H10.10
HSP Len is 728 E-value is 6.8e-05 Bit score 197.9
  Query loc: 207 847 Sbject loc: 640 1330
--
Query is: BOSS_DROME Bride of sevenless protein precursor. 896 aa
Matrix was BLOSUM62
Hit is F35H10.10
HSP Len is 315 E-value is 4.9e-11 Bit score 182
  Query loc: 511 813 Sbject loc: 1006 1298
HSP Len is 28 E-value is 1.4e-09 Bit score 39
  Query loc: 508 535 Sbject loc: 427 454
--

Create an HTML version of a report

#!/usr/bin/perl -w
# ex9.pl
use strict;
use Bio::SearchIO;
use Bio::SearchIO::Writer::HTMLResultWriter;
use Math::BigFloat;
my $cutoff = Math::BigFloat->new('0.2');
my $in = new Bio::SearchIO(-format => 'blast',
                           -file => 'BOSS_Ce.BLASTP');
my $writer = new Bio::SearchIO::Writer::HTMLResultWriter;
my $out = new Bio::SearchIO(-writer => $writer,
                            -file => '>BOSS_Ce.BLASTP.html');

Create an HTML version of a report, continued

while( my $result = $in->next_result ) {
    my @keephits;
    my $newresult = new Bio::Search::Result::GenericResult
        (-query_name        => $result->query_name,
         -query_accession   => $result->query_accession,
         -query_description => $result->query_description,
         -query_length      => $result->query_length,
         -database_name     => $result->database_name,
         -database_letters  => $result->database_letters,
         -database_entries  => $result->database_entries,
         -algorithm         => $result->algorithm,
         -algorithm_version => $result->algorithm_version,
        );
    foreach my $param ( $result->available_parameters ) {
        $newresult->add_parameter($param,
                                  $result->get_parameter($param));
    }
    foreach my $stat ( $result->available_statistics ) {
        $newresult->add_statistic($stat,
                                  $result->get_statistic($stat));
    }
    while( my $hit = $result->next_hit ) {
        last if Math::BigFloat->new($hit->significance) > $cutoff;
        $newresult->add_hit($hit);
    }
    $out->write_result($newresult);
}

Other things covered by Bioperl


Parse outputs from various programs

  • Bio::Tools::Results::Sim4
  • Bio::Tools::GFF
  • Bio::Tools::Genscan,MZEF, GRAIL
  • Bio::Tools::Phylo::PAML, Bio::Tools::Phylo::Molphy
  • Bio::Tools::EPCR
  • (recent) Genewise, Genscan, Est2Genome, RepeatMasker

Things I'm skipping (here)

  • In detail: Bio::Annotation objects
  • Bio::Biblio - Bibliographic objects
  • Bio::Tools::CodonTable - represent codon tables
  • Bio::Tools::SeqStats - base-pair freq, dicodon freq, etc
  • Bio::Tools::SeqWords - count n-mer words in a sequence
  • Bio::Tools::RestrictionEnzyme - find restriction enzyme sites and cut sequence
  • Bio::Variation - represent mutations, SNPs, any small variations of sequence

More useful things

  • Bio::Structure - parse/represent protein structure (PDB) data
  • Bio::Tools::Alignment::Consed - process Consed data
  • Bio::TreeIO, Bio::Tree - Phylogenetic Trees
  • Bio::MapIO, Bio::Map - genetic, linkage maps (rudiments)
  • Bio::Coordinate - transformations between coordinate systems

Bioperl can help you run things too

  • Namespace is Bio::Tools::Run
  • In separate CVS module bioperl-run since v1.2
  • EMBOSS, BLAST, TCoffee, Clustalw
  • SoapLab, PISE
  • Remote Blast searches at NCBI (Bio::Tools::Run::RemoteBlast)
  • Phylogenetic tools (PAML, Molphy, PHYLIP)
  • More utilities added on a regular basis for the BioPipe pipeline project, http://www.biopipe.org/

Other project off-shoots and integrations

  • Microarray data and objects (Allen Day)
  • BioSQL - relational db for sequence data (Hilmar Lapp, Chris Mungall, GNF)
  • Biopipe - generic pipeline setup (Elia Stupka, Shawn Hoon, Fugu-Sg)
  • GBrowse - genome browser (Lincoln Stein)

Acknowledgements

  • LOTS of people have made the toolkit what it is today.
  • The Bioperl AUTHORS list in the distro is a starting point.
  • Some people who really got the project started and kept it going: Jason Stajich, Steven Brenner, Ewan Birney, Lincoln Stein, Steve Chervitz, Ian Korf, Chris Dagdigian, Hilmar Lapp, Heikki Lehv√§slaiho, Georg Fuellen & Elia Stupka

The End