spacer

SurvCurv - Database for survival and other incident curves


  icon FAQ/Help


General

What is SurvCurv?

SurvCurv stands for 'Survival Curves'. It is a database of survival data, developed at the EBI. These data can be represented in different ways, with the survival curve being a common representation.

How can I cite SurvCurv in a publication?

SurvCurv has been published in:

Ziehm, M., Thornton, J.M.,
Unlocking the potential of survival data for model organisms through a new database and online analysis platform: SurvCurv, Aging Cell, 2013

DOI:10.1111/acel.12121   PubMed: 23826631

If using datasets please refer to the SurvCurv ID and cite the linked original publication if available. Please see also the section on Data Licensing.

How can I reference a SurvCurv entry in a publication?

Please refer to the SurvCurv ID, cite the linked original publication if available, and please also cite the database. This will help us to gain support for the database. Please see also the section on Data Licensing.

Scripting/Batch requests

DO NOT SUBMIT JOBS THROUGH SCRIPTS OR ROBOTS!
If you wish to use this service for a large number of requests, or in an automated way, please contact us through the form.


Data

What are censored observations?

Censored observations are observations that have ended due to reasons other than the normal end points (here natural death). Such reasons could be animals being randomly selected for sample collection, dying of accidents (e.g. caused by flooding, air conditioning failure, etc), escaping, or dying from disease or from atypical causes. These observations, although incomplete, still bear information for survival analysis (animals have survived until the censoring event) and should thus never be omitted.

Data Licensing

Access to the web interface of SurvCurv is made under the EBI's Terms of Use. The public SurvCurv data is made available under a Creative Commons Attribution 3.0 Unported License. This license allows others to use, distribute, tweak, and build upon the database, even commercially, without any other restrictions than properly crediting the original work. Please attribute to: SurvCurv Database - http://www.ebi.ac.uk/thornton-srv/databases/SurvCurv/ and cite the scientific article given below. If using individual data sets please also attribute the linked original publication if available.
Creative Commons License

Publication:
Ziehm, M., Thornton, J.M.,
Unlocking the potential of survival data for model organisms through a new database and online analysis platform: SurvCurv, Aging Cell, 2013

DOI:10.1111/acel.12121   PubMed: 23826631

The following icons used on this webpage are from the Crystal Clear icon set by Everaldo Coelho as provided by through Wikimedia Commons.
The icons are licensed under the GNU Lesser General Public License (LGPL). These icons can be downloaded in a single package at Open Icon Library.
icon search icon spreadsheet icon Statistics icon Links icon Login icon FAQ/Help icon Contact icon Acknowledgements

The following two icons, also used on this webpage, are derived from Crystal Clear icons and are provided under the LGPL.
icon survival analysis icon SurvCurv record

What is SurvTab?

SurvTab is a simple tab-separated file format for survival data we defined. Please find a SurvTab-template and equivalent MS Excel template here.
Please note that the final file for analysis should be tab-delimited. In MS Excel use "save as Text (Tab delimited)" automatic ending ".txt" in OpenOffice/LibreOffice use "save as Text CSV" with delimiter set to tab, automatic ending ".csv".

What is JMP­® like Tab format?

This tab-separated file format for survival data is derived from the commercial statistic software JMP­®. It should follow either the template in JMP­® like Tab format 0 or template in JMP­® like Tab format 1, encoding the indicator death events either by zero or one, respectively.

What are the OASIS Tab formats?

These tab-separated file formats (basic and CoxPH) have been defined by the OASIS webservice and are provided here only for compatibility and interoperability reasons.

Can my data be kept private after I submit it?

Yes, all submitted information will be kept private until you release them. They will be only visible to you and the database administrators.

Can I download the complete database?

Please contact us through the form.

Can I download a dataset I submitted to you?

Yes, we are happy to send you the data of datasets you uploaded. Please contact us through the form.

What are these descriptive statistics?

meanMean is the sum of the values divided by the number of valuesWikipedia article
minimumMinimum is the smallest occurring value.Wikipedia article
lower quartileLower quartile is the value that cuts off the lowest 25% of the data.Wikipedia article
medianMedian is the value separating the higher half from the lower half. It is different from the mean in that it is always a value contained in the data set.Wikipedia article
upper quartileUpper quartile is the value that cuts off the highest 25% of the data.Wikipedia article
maximumMaximum is the largest occurring value.Wikipedia article
modeMode is the most frequently occurring value.Wikipedia article
standard deviationStandard deviation is a measure of variability. It is expressed in the same units as the data, unlike variance. The standard deviation is the square root of the variance.Wikipedia article
varianceVariance is a measure of variability. It is less easily interpretable as the standard deviation (see above).Wikipedia article
kurtosisKurtosis is a measure of peakedness. Higher kurtosis means more of the variance is the result of infrequent extreme deviations, as opposed to frequent modestly sized deviations.Wikipedia article
skewnessSkewness is a measure of asymmetry. A value of zero indicates a relatively even distribution on both sides of the mean. Negative and positive Skewness from Wikipedia CommonWikipedia article


Mathematical Models

What are the mathematical models?

The mathematical models are mathematical descriptions of idealised survival curves of different shape. The models are characterised by a number of parameters that are fitted to the data. The fitted parameter values can be compared between cohorts and can be used to find cohorts with similar characteristics using the “Find similar cohort” function available in the detailed section of each cohort. The meaning of the different parameters is explained under the various models.

What is log(MLE)?

MLE stands for Maximum Likelihood Estimate and is the likelihood of the specific model given the observed data. As is common, we present the logarithm of the likelihood. The log(MLE) does not take the number of parameters of the model into account (see AIC and BIC for measures which do). Higher log(MLE) scores indicate a better fit. For more information on Maximum Likelihood Estimation look for example at the Wikipedia page.

What is AIC?

AIC stands for Akaike Information Criterion and measures the relative goodness of fit of a statistical model taking into account the number of parameters. In the AIC lower values indicate better fit relative to the number of parameters. The AIC is useful for comparing different models for the same data, but it gives no absolute quality estimate. This also implies that AIC values for models of different data sets are not comparable. The BIC is a similar measure penalizing the number of parameters stronger than the AIC. For more information look for example at the Wikipedia page.

What is BIC?

BIC stands for Bayesian Information Criterion (also called Schwarz criterion) and measures the relative goodness of fit of a statistical model taking into account the number of parameters. In the BIC lower values indicate better fit relative to the number of parameters. The BIC is useful for comparing different models for the same data, but it gives no absolute quality estimate. This also implies that BIC values for models of different data sets are not comparable. The AIC is a similar measure penalizing the number of parameters less strongly than the BIC. For more information look for example at the Wikipedia page

Exponential Survival Model

The exponential survival model assumes that the probability of dying is constant over time. Thus the survival rate is decreasing exponentially over time, giving the model the name. This model is the simplest survival model and it has one free parameter, the time independent hazard rate a. In general the model does not fit survival in protected environments very well.
Survival = e -a * t
Mortality = log(a)

Weibull Survival Model

The Weibull survival model assumes that the probability of dying increases non-exponentially over time. The model has two parameters: the baseline hazard rate a and the rate of increase in mortality with time b.
Survival = e-(a * t)b
Mortality = log(ab * b * tb - 1)

Reference Publications:

Gompertz Survival Model

The Gompertz survival model assumes that the probability of dying increases exponentially over time. The model has two parameters: the baseline hazard rate a and the rate of the exponential increase in mortality with time b. Note that parameter b of the Gompertz model influences the survival exponentially, while in the Weibull model the influence is parametric.
Survival = e-a / b * (eb * t - 1)
Mortality = log(a * eb * t)

Reference Publications:

Gompertz-Makeham Survival Model

The Gompertz-Makeham survival model is an extension of the Gompertz survival model. Makeham's extension adds a time independent term c to the mortality.
Survival = e-c * t - a / b * (eb * t - 1)
Mortality = log(c + a * eb * t)

Reference Publications:

Logistic Survival Model

The Logistic survival model is a different extension of the Gompertz survival model. This extension allows to model a deceleration in mortality at old age. The model has three parameters: a the baseline hazard rate and b the rate of the exponential increase in mortality with time, analogue to the Gompertz survival model. The new parameter s describes the deceleration at advanced age, while at the same time representing the degree of heterogeneity in the population.
Survival = (1 + s * a / b * (eb * t - 1))-1/s
Mortality = log(a * eb * t) / ( 1 + s * a / b * (eb * t-1))

Reference Publications:

Logistic-Makeham Survival Model

The Logistic-Makeham survival model is an extension of the Logistic survival model. Makeham's extension adds a time independent term c to the mortality, in the same way as in the Gompertz-Makeham survival model.
Survival = e-c * t * ( 1 + s * a / b * (eb * t-1))-1/s
Mortality = log( c + ( a * eb * t) / ( 1 + s * a / b * (eb * t - 1)))

Reference Publications:


Plotting Options

Absolute Representations

Survival CurveThis curve shows the percentage of the population alive over time. It is the most common representation of lifespan data. The survival curve is a cumulative representation.Example of a survival curve
Death CurveThis curve shows the percentage dead of the total population over time. It is the inverse of the survival curve and like the survival curve, a cumulative representation.Example of a death curve
Incidence CurveThis curve shows the distribution of death events or incidences over time. This representation is non-cumulative!Example of a incidence curve
Mortality CurveThis curve shows the negative log-likelihood of dying at each time point, i.e. the mortality, on a log scale. This representation is non-cumulative!Example of an mortality curve

Difference Representations

Difference plots are newly defined plots introduced in the SurvCurv publication, specifically representing only the differences between pair of cohorts, such as control and treatment or female and male. There are 4 difference plots based on the 4 absolute representations, each time showing the difference instead of the absolute value. Thus, for difference survival curves, for example, positive values indicate a survival advantage of the treatment compared to the control and negative values a disadvantage. Difference plots cannot only be based on survival curves, but also on mortality curves, showing relative mortality differences. Here, a line below zero would indicate a lower mortality risk in the treatment.
All these difference plots also support the use of mathematical models in addition to or instead of survival data. This can be useful for exploring the differences between cohorts via the corresponding models. Alternatively, the differences between two different mathematical models of the same survival data can be visualised in the same way, highlighting the differences between the models.

What does "connection style" mean?

The connection style refers to the way the gaps between the observations are handled. No matter how often observations are performed, they are happening at discrete time intervals, e.g. daily, while the plot has a continuous axis, resulting in "missing" values, or gaps. The “connection style” defines how this discrepancy is handled:
stepsThis option assumes that all gaps are due to (omitted) observations of no events, resulting in a stepping behaviour. This is the mathematically speaking "correct" representation for survival curves calculated using the Kaplan-Meier estimator (which is generally used).
linesThis option assumes that the observed events actually happened equally spread between this last observation and the current one. Thus, this options place straight lines between observations.
lines & pointsThis option shows both the lines option and the points option together.
pointsThis just shows the individual points defined by observations, instead of any kind of curve.

What output formats are supported?

Currently five different file formats for the graphical output are supported (SVG, PDF, PS, TIFF and PNG). Each of these formats has specific advantages and common uses. SVG, PDF, PS use vector graphics and are commonly very welcomed by scientific journals as image formats, while the TIFF and PNG are raster graphics, with PNG being the web-optimized default.
SVGScalable Vector Graphics (SVG) is an XML based vector graphics file format developed by the World Wide Web Consortium (W3C). The graphics are fully scalable and labels are included as extractable text. All major modern web browsers have some degree of support and render SVG images directly (MS Internet Explorer before version 9 does not support SVG natively).Wikipedia article
PDFPortable Document Format (PDF) was a proprietary file format of Adobe, now released as open standard. It is a general document format and can include vector graphics, raster graphics and text. The PDF generated here uses vector graphics and embedded text.Wikipedia article
PSPostScript (PS) is a programming language often used as page description language using vector based graphics. Many laser printers can directly print postscript files without any preparation by the computer.Wikipedia article
TIFFTagged Image File Format (TIFF) is a generic graphics file format for handling images and data within a single file. It is widely supported by imaging, publishing and page layout applications, however due to its complexity less supported in other applications such as web browsers.Wikipedia article
PNGPortable Network Graphics (PNG) is an ISO standard bitmap file format using lossless compression. It is optimised for transferring images on the internet, not for professional-quality print graphics. PNG is set as default output format.Wikipedia article

How does "smoothing" work?

Here, smoothing is a sliding window smoothing with selected window size. This means, each value is replaced with the average of the value, the x neighbours to the left and the x to the right. The smoothing is not applied to survival or death plots.

What are historical controls?

Historical control data are commonly used in toxicity and cancerogeneity studies in rodents in addition to the parallel control group. They provide perspective of the study in relationship to existing ones as well as quality control for establishing the reasonableness of the current control group (see review reference below). We suggest that pooled historical controls would be a valuable addition to the usual pair of measured control and measured treatment condition in research in ageing as well.
We have defined a few potential historical control groups based on the annotated collection of survival measurements, which can be directly used. You can also define your own historical control using various criteria.

Reference:
Keenan, C., et al. Best practices for use of historical control data of proliferative rodent lesions. Toxicol Pathol. 2009; 37(5):679-693 doi:10.1177/0192623309336154

What does "automatically combine replicates" do?

This option searches for each selected cohort for annotated replicates in the database and uses observations of all replicates together for the selected analyses. The annotated replicates present in the database are listed in the extended information of each cohort together with other related cohorts.

How to build your own meta-cohort?

This function allows you to combine observations into a virtual cohort according to the criteria you specify and can be used for example to create your own historical controls. You start by specifying a name for your virtual cohort and one or many criteria to define which observations to include. Possible criteria include species, strain, gender, study, date, treatment, but also specified IDs of cohorts separated by comma (ranges can be given as "A-B") or even a search term. You can add multiple virtual cohorts one by one to your plot using the "Add" function.
Please note virtual cohorts that you create will not be stored and your definition will be lost when ending the session.


Tests

What is the log-rank test?

The log-rank test, also called Mantel-Cox test or Mantel-Haenszel test, assesses if there is a difference between two or more survival curves. For more details on how the test works and relations to other test the Wikipedia article is a good starting point. The log-rank test is more sensitive than the Wilcoxon test to differences between groups later in time.
The test report of SurvCurv shows you for each cohort some details of the test-statistics, including the number of observed events and the number of expected events under the hypotheses that all the cohorts are identical. Below these the chi-squared test statistic value, degrees of freedom and their respective p-value are given. Finally, the overall test result based on the p-value is stated.

Reference publication:
Harrington DP & Fleming TR. A class of rank test procedures for censored survival data. Biometrika. 1982; 69(3):553-566 doi:10.1093/biomet/69.3.553

What is the generalized Wilcoxon test?

The generalized Wilcoxon test (here Prentices generalization, which is essentially equivalent to Peto & Peto's generalization) tests if there is a difference between two or more survival curves. The test is more sensitive than the log-rank test to differences between groups early in time.
The test report shows you for each cohort some details of the test-statistics, including the number of observed events and the number of expected events under the hypotheses that all the cohorts are identical. Below these the chi-squared test statistic value, degrees of freedom and their respective p-value are given. Finally, the overall test result based on the p-value is stated.

Reference publication:
Harrington DP & Fleming TR. A class of rank test procedures for censored survival data. Biometrika. 1982; 69(3):553-566 doi:10.1093/biomet/69.3.553

What is the Wang-Allison Score test?

The Wang-Allison Score test is a statistical test for differences between maximal lifespans of two cohorts. To be more robust to sampling problems, i.e. problems caused by the limited number of observations, the longest living 10% are used instead of the longest recorded lifespan. For each cohort the test report shows you the number of observations in the joined 10% longest lifespan, and the fraction of these compared to all observations in the cohort. This is followed by the total number of observations, and the total in the joined 10% longest lifespan, here the fraction should be close 0.10 (as we want the top 10%). Underneath the table the p-value and the age corresponding to the split is given, followed by the overall test statement based on the p-value.

Original publication:
Wang C, Li Q, Redden DT, Weindruch R, Allison DB. Statistical methods for testing effects on "maximum lifespan". Mech Ageing Dev. 2004 Sep;125(9):629-632. doi:10.1016/j.mad.2004.07.003 PubMed PMID: 15491681

What is Fisher's exact test?

Fisher's exact test is a general statistical test to examine the significance of the association between two kinds of classifications, commonly represented in a contingency table. For testing the difference in survival between two cohorts A and B at a certain time point we construct the following contingency table:
cohortAcohortB
#alive/at risk
at time t
ab
#dead
at time t
cd

The p-value is then defined by [formula for p]

The test report shows you for each pair of cohorts the p-value as well as the 95% confidence intervals and the odds ratio estimation. The odds ratio is the odds of an event occurring in group A divided by the odds of it occurring in the group B. Thus, if both groups have the same risk the odds ratio is one. Values larger than one indicate a larger risk in group A, while values smaller than one indicate a higher risk in group B.

Reference:


Cox Proportional Hazards Analysis

What is Cox Proportional Hazards?

The Cox proportional hazards model is a statistical model of survival data with one or more covariates or factors. It allows to identify which factors significantly contribute to the overall model and quantify their influence. The model operates on hazard rates, also know as mortality rates and assumes that the hazards of the different conditions are proportional, i.e. a multiples of each other. This assumption should be checked (see cox.zph test and diagnostic plots below) and taken into account when interpreting the results.
Please note that our Cox online analysis currently does not support stratification or time-varying co-variates.

Further information:
Cox proportional hazards model on Wikipedia

Original publication:
Cox, D. Regression models and life-tables J Roy Statist Soc. Ser B (Methodological). 1972; 34, 187-220. JSTOR 2985181

What does the result mean?

The Cox PH model returns for each covariate or factor a coefficient, and a p-value. The exp(coef) gives the multiplier of the mortality rate, i.e. exp(coef) < 1 indicates a reduced mortality rate corresponding to an increased survival. The p-value indicates the significance of the results.
Please always check the proportional hazards assumption (see cox.zph test and diagnostic plots below)!

What are interaction terms?

Interactions terms in a Cox PH model are additional factors for the co-occurrence of the potentially “interacting” covariates or factors. They allow to examine whether two covariates interact and in which way. Covariates might only exhibit an effect when co-occurring, e.g. the UAS-GAL4 expression system in Drosophila, or create a larger effect than the sum of the individual effects when co-occurring, in these cases only the interaction term shows a significant effect (first case) or the interaction term shows a significant effect of the same type like the individual covariates (second case). A third alternative is that one covariate inhibits the effect of another, which would be indicated by a significant interaction term with an effect opposing the individual one. If an interaction term is not significant, no evidence for an interaction between the factors was found.

What is the cox.zph test?

The cox.zph test is a test for the proportional hazards assumption of a Cox model. It is a chi-square test between the Kaplan-Meier transformed survival times and the Schoenfeld residuals as proposed by Grambsch and Ternau.
The p-values of the proportional hazards tests, like of any other test, are strongly dependent on the sample size. Gross violation may not be statistically significant if the sample size is very small, and even slight violations, causing neglectable errors in the estimated coefficients, may be highly significant if the sample size is very large. An estimate of the size of the deviations from the assumed independence, i.e. no correlation, is given by the respective correlation coefficients rho, which can thus be helpful in interpreting the test results.

Original publication:
Grambsch, P. and Therneau, T. Proportional hazards tests and diagnostics based on weighted residuals Biometrika, 1994, 81, 515-526

What are these diagnostic plots?

The diagnostic plots are a graphical way to access the proportional hazard assumptions. They show the scaled Schoenfeld residuals plotted against the transformed time. Additionally shown are a non-linear fitted line of the data (black solid) and a horizontal line (dotted blue) corresponding to the determined Cox coefficient. The should be no clear overall increasing or decreasing tendency and the fitted non-linear line should roughly follow the dotted horizontal line. The example shown below a very good case.

Example of Diagnostic Plot for the Proportional Hazards Assumption

Example is the diagnostic plot of the UAS-factor taken from CoxPH analysis of SurvCurv:164, 166, 168, 170 (Ikeya et al., 2009) using factors for GAL4, UAS and GAL4-UAS interaction.

Original publication of the used data:
Ikeya T, Broughton S, Alic N, Grandison R & Partridge L (2009) The endosymbiont Wolbachia increases insulin/IGF-like signalling in Drosophila. Proc Biol Sci, 276(1674), 3799-3807

What to do if CoxPH is violated?

If the proportional hazards assumption is violated with noticeable deviations, alternative or more complex analysis might be necessary. Options include Cox proportional hazards analysis with time-dependent co-variates or transformed co-variates, or Accelerated Failure Time (AFT) analysis. These are currently not available in SurvCurv. However, you can download the survival data of interest from the database and perform these analyses locally using your preferred statistics software, such as R.


spacer
spacer