elp    Advanced search



is a web application that combines Information Retrieval and Extraction from Medline. It retrieves the Medline abstracts that match your search criteria in the same way PubMed does. Then it analyzes them to offer a complete overview on associations between UniProt protein/gene names. The associations are ranked in three levels of confidence:

  1. Ppi: Pattern matching (natural language processing), being the highest level, this method is precision based.
  2. Co3: Tri Co-Occurrence, two protein/gene names are found in conjunction with a verb in a sentence. Overall, the number of times (the number of sentences) the three entities appear together is greater than what can be attributed to chance. This method offers an intermediate confidence level and is a mid step between precision and recall.
  3. Co: Co-Occurrence, being the lowest level of confidence, is recall based.

The results are shown in a table that displays all the associations and links them to the sentences that support them as well as to the original abstracts. When appropriate, the involved verbs are also displayed. The table is sorted by relevance so that the associations better supported by the evidence are found higher up.


Supported queries

1) By list of PMIDs: Where you type in a list of PMIDs. A PMID is 1 to 8 digit number with no leading zeros used to uniquely identify a Medline Abstract (more). This is the default query type so if your query string contains at least one valid PMID the system will retrieve just the Abstracts of which PMIDs you provide, unless you explicitly go to the advanced query form and select other type of query.

2) By list of terms: Where you type in a list of terms. The terms are looked up in the AbstractText, ArticleTitle, AuthorList and MeshHeadingList fields. Only the Abstracts having ALL the terms in either field are retrieved and processed. If you use any reserved word in your query string it will be taken as a customized query.

3) Customized query: Where you type in a valid query. Valid queries follow the syntax rules that are explained bellow.

Note: Only up to 20 terms are allowed in the query string.


Query Syntax Rules

A query is composed of terms and operators.


There are two types of terms, which are single terms and phrases. A single term consists of a single word such as "Miguel" or "Arregui". A phrase is a group of terms within double quotes such as "Miguel Arregui".

There are some characters and words that _cannot_ be used right away to write a term because they have a special meaning. They are:

PMID AbstractText ArticleTitle AuthorList MeshHeadingList DateCreated DateCompleted DateRevised PubDate Language

TO AND OR NOT + - && || ! ( ) [ ] { } ^ ~ : \

If you need to use them you will have to escape them by a leading "\" in the case of operators and by quoting it in the case of field names. For example if you desire to search for "(1+1):2" you will have to write "\(1\+1\)\:2".


Queries are always performed on fields. Each document in Medline has been broken down into the following fields:

PMID AbstractText ArticleTitle AuthorList MeshHeadingList DateCreated DateCompleted DateRevised PubDate Language

and then a comprehensive index has been produced based on these fields. When you write a query you can specify which fields in the index are to be looked up. To do so you provide the field name followed by a ":" and then your single term or phrase.

If you do not specify a field name, the default one depends on the type of query you write. For queries by list of PMIDs (the query string has at least one PMID) the PMID field is looked up. For queries by list of terms the fields AbstractText, ArticleTitle, AuthorList and MeshHeadingList are looked up so that either one has ALL of the terms you provide. For customized queries you provide the fields to be looked up. Be careful with the way you type in your queries for the following query: "AuthorList: Miguel Arregui" will be translated into: "AuthorList: Miguel AbstractText: Arregui" because the default field is always AbstractText. To avoid this, you can use parenthesis to group terms: "AuthorList: (Miguel Arregui)"

Note: The default operation is AND, so the later query will only pop up those Abstrats that have "Miguel" AND "Arregui" in the AuthorList field. You can alter this behavior by explicitly using the OR operator: "AuthorList: (Miguel OR Arregui)".

There are several ways to modify your terms

Wildcard Searches

You can use the "*" character to variate the term by zero or more characters in the position you place it. So if you want to search for "House", "Hose" or "Horse", you could type in "Ho*se". You can use the "?" character to variate the term one exact character, so going back to our example, if you type in "Ho?se" you would get "Horse" or "House", but you would not get "Hose". You may use "*" or "?" at the end of the term but _never_ in the beginning.

Fuzzy Searches

Single term searches can be enhanced to benefit from the wonders of fuzziness. You achieve that by appending the "~" character at the end of the term. The query will give you back similar terms to the one you provide. So if you query for "long~" you might end up getting "bong". You can provide a numeric parameter between 0.0 and 1.0 to the fuzzy search (defaults in 0.5), like in "long~0.8". This query will be expanded in terms similar to "long" in an 80%.

Proximity Searches

They look a lot like fuzzy searches, but now you need to provide a phrase, not a single term, followed by an integer, not a floating point number. The search will match those terms that are in the field within at most the specific term distance away. For example: "Miguel Arregui"~2 will retrieve Abstracts that have "Miguel" AND "Arregui" from within at most a 2 terms distance in the AbstractText field.

Range Searches

Range searches are allowed. The ranges can be between numeric field values, such is the case of dates (DateCreated, DateCompleted, DateRevised, PubDate) and they are expressed by:

"fieldName: [numer1 TO number2]"

where numer1 and number2 follow the "YYYYMMDD" convention, four digits for the year, two for the month and two for the day. For example: "PubDate: [20040101 TO 20050505]" will return documents that actually have a value in the "PubDate" field and ranges from between the 1st of January of 2004 to the 5th of May 2005. Ranges can also be between lexicographical field values, for which the syntax is similar to that of the numerical ranges, but using {} instead of []. So, for example: "ArticleTitle: {Arregui TO Miguel}" will retrieve the Abstracts that have authors named "Arregui", "Miguel", "Carlos", ...

Boosting terms

Yep, you can boost a term to give it more presence. You boost the term by appending a "^" and the boost factor as an integer. For example: "Miguel^4" or "Miguel Arregui"^4. Boosting a term makes it more important than any other term that comes along in the query.


The operators allow to combine terms to make more precise queries. The operators are:
  • OR: The conjunction operator links two terms and finds documents having either one. It is like a Union of sets. The symbol "||" can be used in replacement.

  • AND: The intersection operator, used by default if you do not provide any, links two terms and finds documents having both of them. The symbol "&&" can be used in replacement.

  • NOT: Excludes Abstracts having the terms preceded by this operator. The symbol "!" can be used in replacement. This operator cannot be used with just one term, because it will give no results (its function is to remove items from a result set).

  • +: The term preceded by this operator must be present in either field of the retrieved Abstract.

  • -: Abstracts having the term preceded by this operator in either field are excluded.

  • (): Parenthesis eliminate confusion and group certain terms and operations. For example: "(Miguel AND Arregui) OR Nothing" is not the same as "Miguel AND (Arregui OR Nothing)".


HitPair Table

The HitPair table is the table that displays the HitPairs. So, what is a HitPair?.

A HitPair is the co-occurrence of two different Protein/Gene names in a sentence. Different means that the entities are recognized with their own distinctive id in a public domain database, UniProt in this case, thus you will never find a HitPair composed of the same entity repeated. To give an example, the proteins "MAPK" and "BMP" form a HitPair if there is a sentence that contains them both in any Abstract retrieved by your query. As you can imagine a HitPair can be present more than once in an Abstract (several sentences might have it), as well as it can be found in several Abstracts all over Medline, or to be more precise, all over the set of Abstracts retrieved by your search criteria. At last, the entities in a HitPair are not ordered, so "MAPK ~ BMP" and "BMP ~ MAPK" are the same HitPair. The HitPairs in Protein Corral are extracted following three different methods, Natural Language Processing (precision based), Tri Co-Occurrence (mid step) and simple Co-Occurrence (recall based).

The HitPair table has several rows, each representing a set of HitPairs, and has several columns. The first column represents one half of a HitPair that is to be found in conjunction with the other halves represented in the second column. The next columns show the amount of Abstracts/Sentences in which the particular HitPair is found for the listed (in the header of the table) method. The last column gives a compilation of the verbs that were found as the evidence of the association between the two halves that compose the HitPair.

The HitPair table is ordered by relevance. The Rows are sorted by the number of Abstracts/Sentences in the first method, then the second and at last the third. The columns are sorted following the same criteria.