PDBeAnalysis tutorial
This
tutorial is designed to introduce you to the PDBe analysis services of PDBeValidate,
PDBeStatistics, PDBeSelect, PDBeResidueStatistics and PDBeDatabase. These
services allow analysis of the molecular structure data (via PDBeValidate) and
also simple statistical analysis of data help on molecular structure data at
the entry and residue level.
PDBeValidate : Macromolecular structure validate
PDBeStatistics : Statistical analysis of structure
data at the structure level
PDBeSelect : Selection of macromolecular structures
PDBeResidueStatistics : Statistical analysis
of structure data at the residue level
PDBeDatabase : SQL query interface to the PDBe database
PDBeEntryResidue
: Statistical analysis of residue based data dependent on entry based data.
PDBAtomStatistics : Statistical analysis of structure data at the atom
level.
You can access
the PDBeAnalysis services from the PDBe front page http://www.ebi.ac.uk/pdbe using the PDBeAnalysis
link within the service list. You will be presented with an intermediate jump
off page for the 5 analysis services, with an entry box for the structure ID
code for validation.
The PDBeValidate
service is designed for the analysis and presentation of information about a
single macromolecule, specifically for the identification of geometrical
outliers. PDBeStatistics, PDBeSelect and PDBeResidueStatistics are designed for
the statistical analysis of a property over all (or subset of) the PDB archive.
In fact, the analyses are based on the assembly structure of the macromolecules
and not the deposited experimental data. It is therefore hoped that these
services will provide some meaningful insight into basic analysis of macromolecular
structure. PDBeDatabase is a service that allows direct use of our database
using SQL to do select queries, so is
probably of limited use to many people.
PDBeValidate.
From the PDBeAnalysis
jump page http://www.ebi.ac.uk/pdbe-as/PDBeValidate
please enter the ID code 2sod into the entry box on this page and click
the button [validate]. This will take you to the validation page for the
protein superoxide disumutase. This structure has a reputation of being not
quite up to the normal standards of structure determination at this resolution.
The page will
open to show the view on the right. If you scroll to the bottom you will see a
quick help section that will get you started. You can see a summary table at
the top, a graph (the Ramachandran plot)
and a table of outliers. From the summary table at the
top of the view please click the 2sod
link, this will open the atlas page for this structure in a new window. You may
close this window, or hide it, it will not effect the usage of PDBeValidate. You
can see the basic detail of the structure from this page, and there are links
to view assembly, sequence, citation, similarity, ligands and visualization of
this structure. Close or hide this summary page.
You will see that
the default graph is the Ramachandran plot which contains grey, blue and green
plot points. The green points are data from GLY residues and these do not have
to be within the allowed regions marked with contours. The blue points are from
PRO, and should have phi values only around –90 degrees with a variance of 12
degrees. Finally, the grey points are for all other residue types found in
proteins, and should mostly within the contoured region. As you can see there
are large numbers of outliers for this structure, and these are shown in the
table. The table is colour coded using black/green and blue data values that
match those of the graph, though there are no PRO outliers. The graph and table
are active, in that clicking points and cells results in something happening.
1) Click any
point in the graph outside a contoured region. You will see the table moves so
that the equivalent point is centred and marked in yellow. What happens if you
click a point that is within a contoured region of the graph, and why.
2) Click any cell
in the table in the third column headed residue.
You will see the equivalent point in the graph highlighted in magenta.
3) Click the
[Show structure] button at the top right, this will change to [ON] while the
AstexView@PDBe-EBI loads, and turns back to [Hide Structure] when the viewer
appears. Only the structure related commands are available in this version of
the viewer. Clicking either a graph or table value will update the structure to
highlight the picked point. Try picking a few points from the table and graph
and look at the structure. We have noted that there is a problem with the
amount of space on your screen to show the validate page and the viewer, so the
viewer is left as a floating window so you can place this somewhere sensible, though
this is not ideal.
Click the [Hide
structure] button and the viewer will close.
4) Click on the
drop down arrow next to the word [Ramachandran] in the top left control box, a
number of different validate methods will become visible. Select [omega] from
the list. This is the torsion angle CA-C-N-CA for a di-peptide and is normally
found in a trans conformation, there is one CIS conformation in this structure,
it is not a proline. Cis peptides are
usually present for a reason within a structure as they are high energy
conformations, except for proline where the cis and trans conformations are of
similar energy. Therefore, if a cis residue is not proline (marked blue on the
graph and in the table), and not near an active site then the conformation is
likely to be suspect. How would you check the positions of the active sites in
the viewer ?
5) Try a number
of different validation options and look at the analysis. What are they
attempting to show ?
6) This is a more
difficult question, though is a reminder of what you should think about when
analysing structures : Load the structure 1mbd into the validation service and
click on the [bond error] option. You will a significant number of residues
where the mean bond deviation over a residue is large, why is this ? It is a
1.4A structure so should be quite good. Hint : look at the date of release,
then think how the bond errors are calculated .
Click on the link to PDBeStatistics from the main PDBeAnalysis page. The
service may take a second or two for the first load as the controls are
downloaded and instantiated within your browser; future visits to the page will
result in a much quicker start up. If you scroll to the bottom of the page you will
see a quick help section. When the page opens you will get an empty PDBeStatistics
page, similar to the figure on the right, but without the graph and table. This
service is designed to return distributions based on a
number of data calculated from the deposited structures.
The service has 4
selection boxes along the top of the page, and then on a second line a number
of different controls.
1) The set of 4
controls at the top of the page actually control a single parameter, they have
been split into 4 to provide some clarity and also reduce the length of the
option list of a single combined set.
2) The parameter
option box on the second row of controls provides access to a second
independent residue for analysis. This control, when set to [select option]
results in only a single parameter analysis, but when an item is selected then
a 2 parameter analysis is created.
3) [Show
distribution] is the action button that draws a graph based on the options set
in (1) and (2). Clicking this button does a query against our database and
produces the graph similar to the figure above.
4) [Get] The Get
action button allows you to download either the raw data from a graph, or the
data in a table so it is available locally. The raw data for a graph
(distribution) is downloaded if no table is present, otherwise the table of
data is downloaded, and in both cases this can be obtained as text, XML, or the
SQL command to generate the data.
5) [PDBeDatabase]
action button will transfer the last query (as SQL) to the PDBeDatabase service
so that it can be adapted.
From the
[Experiement] drop down list at the top left of the page select the Resolution option, then click the [Show
Distribution] button. You will get a counter within the web page that counts up
and suggests that the query will take about 3 seconds; the timing depends on
the database load of our service, and also download time of the data which is
dependent on the bandwidth available to you. On completion of the query you
should see the graph as shown in the figure above, no table will be shown yet.
What is this graph ? Make sure you understand how this
is calculated as this is critical for the understanding.
At the top of the
graph is a double scroll bar. You may slide the two end bars independently
(with a click and drag action), or both together by click/drag of the central
region. Try this, what happens to the graph ? What happens to the Y-ordinate if
you slide to one end so the large peaks are scrolled off the view ?
Click the top of
0.9A bar from this resolution distribution. The table appears, and so your page
should look like the image above. This table contains a list of the PDB code
where the resolution is between 0.85A and 0.95A, and as of 9/11/2006 there are
91 such structures.
What happens of
you click the id code of a molecule from the table? Ie select the cell containing
1bxo, the second column, 6th row (the row may be different if new depositions
move this row). What happens if you click on the cell containing the number a,
ie the count number of the ID code 1bxo?
You can see that
the graph can make queries to our DB, and the table can open other services.
Now select
[assembly type] from the assembly options, second item on the option list in
the second drop down list, click the [Show Distribution] button. Why is this
drawn as a pie chart and not a distribution ?
Notice that you
cannot read many of the labels as they overlap, so try using the double scroll
bar to control the view. How many assembly types contain just 0.001 % of the
data and how many entries are there in one of these segments ? (Hint : click
one, what is in the table ?)
Select the
Octameric structures, this is the 6th largest segment, you may need
to slide the max slider along to view this segment easily. How many entries are
there with this assembly type ? How would you get this list as
XML ?
Can you create a
distribution of resolution as a function of R-factor ? What type of plot is
produced? Rotate the graph using a click
and drag action so that it turn flat on the screen and pull the max slider
completely to the left so it is next to the min slider. So how does R-factor
and resolution correlate?
Why is it not
possible to draw a graph of resolution as a function of assembly type?
The PDBeSelect
service is quite similar to PDBeStatistics, but allows you to combine many
different selections to create a multi-filter query against the database. When
you go to the PDBeSelect service (from the PDBeAnalysis jump page)
you should see a page
similar to that on the right without the graph and
table. The four top controls are identical to those from PDBeStatistics, and
allows the same distributions to be created. In the view on the right is the
distribution of resolution.
Select the top
left drop down and pick resolution. Now click the action button [Show
Distribution]. You will now see the distribution of resolution as shown in the
figure on the right (note it has been adjusted with the min-max slider
controls). The next thing it to select a range from this.
1) click a point
on the graph. You will see that an entry is put in the table of
Resolution/min/max where min and max are the same and from the column you
picked from the resolution graph.
2) click and drag
a region, ie click down at about 0.0 on the x-ordinate (resolution), and drag
the mouse to the right until you reach 2.0 on the x-axis. You will see a grey
box appear on the graph that shows you what you have selected. Now the table is
filled with Resolution/min/max where min ~ 0.0 and max ~ 2.0.
You can repeat
this selection as many times as you want, and at any time. If a resolution
entry is already present then this will be updated to the values you select,
otherwise a new entry is put in the table of your selected range.
You are now free
to select other options form the 4 drop down menus. Select the assembly type
(in Assembly) octameric by clicking
the range from the pie chart; and proteins with 2000 or less residues (from
Other). If you have a selection range you want to delete from the table you
need only click on the field column in the table of the row you want to
delete. The row will be removed from the table. How many proteins are there
with a resolution between 0 and 2A, that have an octameric assembly and have
less than 2000 residues ? Hint, what does [How many entries ?] do. Now download
the list of protein IDs as XML.
You may find this
method a little difficult to get exactly what you want, it is really a browser
of data; I have a resolution range between -0.008 and 1.932. You are now going
to fine tune this. Click on the [PDBeDatabase] button and PDBeDatabase will
open with the last query you made.
Therefore make sure the last thing you did from PDBeSelect was to click
the button [How many entries ?].
PDBeDatabase is
used to create queries against the PDBe search schema using SQL. There are some
preset queries you can use, and if you have come to PDBeDatabase from other PDBeAnalysis
services then you will see the type of queries used to create distributions. If
you have come from the PDBeSelect section of the tutorial above then you will
see a view similar to that on the right, so please jump to this section the tutorial to work with this
query.
The PDBeDatabase
page contains a number of controls, the top left is the most simple
that just shows the contents of various key tables in
the database, as well as lists of tables, attributes and indexes. Click [Show
table] with the default table summary and you will get a list of all the tables
in the PDBe search database. Note that the data does not come back in any
particular order, since there is no predefined order to a database, but you can
sort the data by clicking the title bar. Notice also that there are limits to
both the number of rows you can return, and the time allowed for a query, and
both of these limits cannot be adjusted to more than 1000 and 600 respectively.
Therefore, since most of our data tables contain more than 1000 rows, you
should use filters to view the data of interest.
Change the table
to show to [Table column] and click show table, you will only get 200 rows back
as this is the default limits. Therefore, change the Row Limit to 1000 and try
again. What happens if you set this to
2000?
The [Get] button
allows you to download any data in the current main table. Essentially it
re-issues the last SQL used again with no row limit constraint. (The time limit
is always present). The data is returned to a new window as text or XML. Why do
you think there is a [count] button in this control box?
Try typing select something into the text box and
submit this query. You will get an oracle error returned in the table OEA-00923: FROM keyword not fou. You can see what the rest of the error is by
dragging the title divider bar to the right. All the oracle errors are returned
as a title bar in the table.
If you have come from the PDBeSelect then you will be able to modify the query
to change the resolution range to exactly between 0 and 2A. Edit the text box
so that the string reads entry.res_val
between 0 and 2; you can see that previously I had entry.res_val between –0.0080 and 1.932. (You may need to use the scroll buttons on
the right of the text entry to find the relevant piece of SQL). Click the
button [submit SQL] and the query will be run returning the table with the 158
results ( on the 7th of November 2006). The table contains the same
type of data as in PDBeSelect table for the combined filter, but in this case
the column titles are the real attribute names in the database.
The page
contains a number of control sets, a number of action buttons, and a graphical
area at the bottom that presents the data. Any data presented in the graph can
be obtained as raw data from the PDBeDatabase service by clicking the [PDBeDatabase]
action button. You can then re-run the SQL. The
top right control set provides
structure-based filters to include/exclude different data. You can select data
based on author name, resolution, date and R-factor. The top right control box
defines the x-ordinate and y-ordinate that will be returned within a graph. The
default is 1 ordinate only of omega values. This control box also allows you to
select the representative set to use in the analysis. In general, similar
protein structures are likely to have very similar residue properties, so you
should use a representative set for any analysis you require. There is still an
option to use no representative set (NONE), but this is not recommended.
The
control box on the second row allows an analysis to be made for 1 or more
residue types. If none are selected, then all residue types are used in the
analysis. The third row contains a control box that limits the secondary
structure type for the residues in the analysis, if none is selected then all
structure types are included.
The [PDBeDatabase]
control button will drop the SQL used to create a graph into PDBeDatabase to
allow modification, or get the raw data.
There are
a number of different control buttons, [submit] is the good first starting
point, in fact just click submit with all the controls as they are set. You will get a distribution of omega for all residues in the Dali
representative set. This will take about 1 minute to calculate. We are now
going to do something a little more interesting, find how the
Ramachandran plot changes with
resolution.
Within
the top left filter box, set the resolution range between 0A and 1A range as
shown on the left.
Now change the ordinate selection so that phi is the
x-ordinate and the y-ordinate is set to phi; this is this Ramachandran graph
ordinates. I have selected the WhatIf representative set here.
Now click the
[submit] button. You will get 2D distribution of phi against psi with a sample
size a little under 16,000 data for this high resolution set; quite similar to
the graph at the top of this section. Now go back to the resolution range boxes
and change the range to low=2A and high=3A. This query will take longer as
there are many more data in this resolution range, and a similar graph will
appear with a sample size of about 1 million.
You can see that
the service has remembered you old queries, and the main control line shows
which query number you are on, the first query was the simple omega query, the
second query was the high resolution phi/psi and the third was the lower
resolution phi/psi, giving a total of 3 queries, and you are looking at query
number 3. Click the [Previous Query] button and the page will show the high
resolution query, and clicking this again will show the omega query. Notice how
all the form fields are updated to reflect the query, not just the resulting
graph.
Now we what to
calculate a difference plot for high and low resolution Ramachandran data. Make
sure the last 2 queries are the low and high resolution
phi/psi search, you can use the previous and next control
buttons to check this. Now look at the page to find the difference control box,
and select the normalise check box as
shown on the right; click the [Difference] control button. You should get a
difference plot for the phi and psi ordinates for the two resolution ranges.
1) What did the
normalise check box do? Do a difference plot without the normalise option
checked to see the difference.
2) In the
difference plot, what does the sample size measure? (see the graph title).
3) Use the min
and max scroll bars to zoom in on the zero z-ordinate, and turn the plot so it
looks like a contour plot, what do you see in the alpha helix region?
3) Now create a
difference plot of Ramachandran for ASN and ASP, what do you see when you zoom
in to highlight the near zero z-ordinate ?
4) What happens
if you pick a point on the graph, what do you think is returned ?
PDBeAatomStatistics
is used to look at atomic data based on 1-2 interactions (bonds and
non-bonding), 1-3 interactions (angles) and 1-4 interactions (torsion angles). The
service is based on the 0.5 billion atom data rows within the PDBe database. It
is possible to generate some long queries here, but the default set of filters
provided will generally return results in about 30seconds to 1 minute.
Note, that due to restrictions of data return size it
is possible to do queries that return a truncated list of results. This is to
protect the client and server side resources from large volumes of data. In
general you will not improve statistics by selecting filters that would return
huge quantities of data. The PDBeAtomStatistics interface is shown in the
figures and it is clear that this service is more complicated that all the
other statistical interfaces and allows a number of different possible queries.
It has filters based on entry data (resolution, date and representative set) as
well as up to 4 atom definitions with sequence dependences. The main query box
contains 11 field boxes for Atom Name, Residue Name and sequence Offset. Atom
name is the standard atomic name, and to the left is shown SG, the gamma
sulphur atom for cystine. In the example query only 2 atom names are given
which means the query will return distance
information. Residue name information is optional, and
in the example shows the residue cystine. It is also possible to supply a comma
separated list of residues within a box to indicate an atom can be any of the
supplied residue list (eg. asp,glu), though in this example SG is only found in
cystine so this would not be relevant. Finally, the sequence offset is useful
for more general interaction specifications. If we consider the disulphide bond
example we would not normally provide any sequence offset requirement since the
2 atoms within a disulphide do not have a specific sequence dependence; though
see later for a sequence restrictions which may be of use. The sequence offset
box is therefore left clear in this example. If we consider the query to look
for the backbone torsion angle omega within proteins then we
can use the query design shown. The atom names are
defined as (ca-c-n-ca), there is no residue specification since we want to look
at all possible residues, and the sequence offset has been filled in with the
values (0-1-1) where the first atom has an assumed value of 0. If we re-write
the definition of the omega torsion as a function of the ith residue
then we obtain ca[i+0]-c[i+0]-n[i+1]-ca[i+1] and it becomes clear that the
sequence offset in this case is the value added to I to define which residue we
are talking about relative to residue I. Notice that the first atom is always
defined as belonging to residue I. The residue names can take a number of
values; so far we have used CYS for the disulphide search and blank for a
general search. A residue value can also be a comma separated list of residues
(eg: asp,glu,asn,gln) or NOT a residue or list of residues (eg: !pro)
In general, 1-2
interactions which are bonded require a sequence offset specification while
non-bonded atom interactions (though space) have no sequence dependence and
should have no value for sequence offset included. Angles and torsion normally
require sequence offset values to return meaningful results.
The next
parameter set has 6 fields as 3 sets of min-max pair values. Distance cutoff
defines the plotted range of values that recorded for a query. Notice that
distances often require valued of 0 for min and < 10 for the maximum value;
whereas torsion and angle queries require values of 0 to 180 and -180 to 180
respectively. This is a common error when making a angle query and nothing
appears to be returned. The B-value cutoff is set to a default of min=0.0 and
max = 40.0 A2 where the B-value defines the refined 4th
atomic parameter within crystallography and how well locallised that atom is
and how well observed. If you unclear about the meaning of B-value then it is
recommended you leave the values at the default. The final filter pair is
sequence restriction. These number (with no default value) are only of use for
through space 1-2 interactions, ie non-bonding or disulphides, where the values
define the minimum and maximum sequence distance between the atoms in the query.
1) Perform an
Omega torsion angle query. Fill in the atom definition first; what do you
notice about the min and max values as you add the 3rd and 4th
atom definition ?
2) How would you
create a search for omega where the residue i+1 is proline only, and then NOT
proline.
Finally, the
difference function block. The difference function allows the presentation of
results as the numerical difference (either normalised or not) between the
current result and the last result. Since all the analysis returns 72 bin
distributions it is possible to calculate a difference function between any 2
result sets, so it is up to the user to actually make a sensible judgement of
whether a difference function is meaningful. The normalise option within the
difference block is used to equate the areas under the curves before
subtraction so data can be compared even where the number of results is not
equivalent.
3) Calculate a
difference function for omega for residues when (i+1 = PRO) and then (i+1 =
!PRO).
4) What do you
notice about the cis/trans distribution; did you expect this ?
There are action
button to run a query, look at the previous and next query in the list as well
as clear the fields to the default values. A query will be shown as running by
a counter in the graph window, and the graph can be picked to return a selected
list of atoms in a table. The table of atoms can be picked to either open the
atlas page for the entry containing the atom (selecting the PDB-ID), or open PDBeValidate
(with optional link to the AstexViewer@PDBe-EBI structure viewer).
5) You might like
to see how omega varies as function of resolution (0 to 1.5A range and 2.0 to
2.5A).
Data saturation
The PDBeAtomStatistics
service has an issue regarding the implementation of the service. Due to the
size of the atom data (0.5 billion rows of data) a compromise was made
regarding the service due to technical reasons. This means that any one query
many not return more than 10,000 rows, and data is correlated and extracted
from this subset of hits to give a distribution. Therefore, increasing the
return size (by increasing the resolution range) will not improve the
statistics of the results, just make the query longer. A solution to this
limitation is being sought.