![]() |
Literature Services - Internships: TasksThe list of the task listed below is neither exhaustive nor immutable. If there is an interest or other good reasons we could modify some of these tasks by extending/limiting their scope or changing their other requirements. We could also add some tasks not listed here and are open to ideas and proposals from the students, their academic supervisors or other interested parties. Also, in the course of the internships some changes may become necessary.Task A. Extracting metadata from HTML-formatted academic papers. Task B. Extracting the citation context Task C. Identifying and extracting the structural elements of a paper Task D. Extracting the images, the image captions and their contexts. Task E. Adapting the LibX Firefox plugin to connect citation information in HTML pages to Google Scholar and the citeXplore database Task F. Reconstructing the text flow of academic papers which have been converted from PDF format into HTML Task A. Extracting metadata from HTML-formatted academic papers. Currently we extract a minimal set of metadata such as family name of first author and title from the HTML-formatted academic papers. Other useful information would be:
Task B. Extracting the citation context Every work that appears in the bibliography of a paper is cited within specific context. An exaple is given below in Fig. 1: ![]() Fig. 1 Citation context In this case (11-13) refers to the following papers mentioned in the bibliography: ![]() Fig.2 Bibliography of a paper The sentence 'Studies in Xenopus egg extracts have demonstrated that an analogous XIORC is required for initiation of replication' is thus the citation context of all the papers listed in 11-13. Extracting the context of a paper helps us understand what is the paper about. It is another type of paper topic metadata similar to the title, the keywords, the abstract and the conclusion. Its advantage over the these types of metadata is that it reflects the informed opinion of the scientific community about the value and the relevancy of the paper cited whereas the other types of topic data are given by the authors themselves which makes them less objective. If the citeXplore users are given all the citation contexts for a paper they can very fast determine its topic and relevancy to their area of research which would save them time and increase their efficiency. The context extraction in essence is about creating a link between the papers from the bibliography and the sentences and paragraphs where they are referred to. question is. However, delimiting the begin and the end of the context is not that trivial. At a minimum, it is the sentence where the citation identifier(s) (in this case 11-13) are mentioned. But sometimes the context can stretch into the next sentence or even further. Identifying the limit of the citation context is thus a bit more complicated than extracting the sentence where the paper has been cited. Different strategies can be applied, e.g. the citation context ends before the sentence where other paper is cited or some linguistic technique can be used to determine topic continuity (e.g. sentences starting with 'in addition..', 'An example of this is..', etc. suggest topic continuity). In general, it is a potentially quite interesting task which allows going beyond what is taught in the standard CS/SE curriculum thus it might be of special interest to a student in computer linguistics or natural language processing. But a rough approach such as extracting only the sentence where a paper has been cited might be an acceptable solution as well as it corresponds to the majority of the real life cases. Task C. Identifying and extracting the structural elements of a paper Every paper has distinct structure. Common structural elements are: abstract, introduction, main thesis, methods, experimental validation, discussion of the relevant literature, conclusion, proofs and other addenda/exhibits. Identifying the structural elements would enable us to create a skeleton of a paper which would contain the section and subsection headings. Creating such skeleton would allow us to finely index the paper (e.g. by giving a different weight to terms based on the section where they are mentioned – a term in the relevant literature part might be less relevant than a term mentioned in the conclusion). In addition, linking the corresponding text blocks to this skeleton would enable the user to browse within the paper thus offering them greater flexibility. Quite often the users are not interested to read the whole paper (or don’t have the time for it) but would rather skip to a part which is of greater interest for them (e.g. “Experiments”). This task is in our opinion relatively straight-forward (identfying the subsections using layout cues) thus can be done with relatively simple means. However, the variety of layouts encountered may again necessiate the use of ML methods. Task D. Extracting the images, the image captions and their contexts. A lot of the information presented in a paper is contained in the images included (X-ray/Cristallographic examination, schemes of protein interactions, 3D-models of molecules and many others). Currently this data is hidden within the papers. We want to extract it and implement a distinct image search functionality within citeXplore similar to “Google Images Search”. In this case we would need to extract not only the images themselves but also their captions and their citation contexts. The image is usually pointed to within the text of the paper e.g. “The interaction of the protein hT1 with protein yG procedes in three stages (fig 1.)”. In this case we can infer that the fig.1 contains information about the proteins hT1 and yG and use these information to index the image. This task is in essence a combination of task C and B. It is thus potentially suitable for a larger project such as masther thesis or final undergraduate project. But a rough extraction using simple textual cues (“extract the sentence where the pattern [fig.] occurs”) may yield potentially satisfiable solution as well. Task E. Adapting the LibX Firefox plugin to connect citation information in HTML pages to Google Scholar and the citeXplore database The LibX Firefox plugin developed at the University of Virginia detects citation information in HTML pages, creates a clickable link out of it using the GreaseMonkey framework. When clicked, the link submits a query to a college library database. The idea is to shorten the path for the user from any web page to their college library repositories. The plugin has been adopted by a large number of US universities (MIT and Stanford to mention some of the most popular) and is installed by default in the Firefox browsers available on the campus computers at these universities. While creating a citeXplore LibX edition should be trivial (formating the generated link to point to citeXplore) we think we can do more than just that. An interesting application would be to automatically submit queries to Google Scholar and thus informing the user directly about the number of papers citing the paper mentioned on the page. We would prefer to analyze the search result of the Google Scholar query to possibly extend the citation network of citeXplore. This should be straight-forward as a citation extraction algorithm is contained within the LibX plugin but client-based processing and the communication to citeXplore may require some perforance finetuning. The idea is that this processing and communication should not affect the performance of the browser and the user experience. It is unclear as of now how difficult this task may be thus to be on the safe side we would suggest that the task is undertaken as part of a larger project (e.g. final undergraduate project or a master thesis). We prefer that this project is done on-site. We think this project is of high practical relevance for a future IT professional as browser-based applications seem to be an increasing trend reflecting the need of independent software developers to achieve greater portability and avoid the limitations of a specific operating system. Task F. Reconstructing the text flow of academic papers which have been converted from PDF format into HTML As part of our project we are converting academic papers in PDF format into HTML. PDF is a layout-oriented binary format which in its initial versions did not contain any semantic information (e.g. paragraph, title, etc.). Thus converting back a PDF document into a text-based format will require to "guess" the semantic information based on the layout. An example of it would be to infer that title of a document is the text string on its first page which has the largest font size. The reason is that the HTML document which is produced as result of the PDF conversion does not have <title> tags. It only contains layout information such as coordinates of the text string, font name, style (e.g. italics, bold) and size. This layout information however does not encode the text-flow of the document (no <p> tags available). This has to be reconstructed using the white space between columns and paragraphs. Mostly the academic papers follow simple two-column layout like the one below. ![]() Fig. 3 Regular text flow of an academic paper (T.M. Breuel “High Performance Document Layout Analysis” SDIUT 2003 ) But sometimes the text flow is a bit more complicated (fig. 4) ![]() Fig. 4 Irregular text flow of an academic paper (T.M. Breuel “High Performance Document Layout Analysis” SDIUT 2003 ) We need algorithms which use the white spaces between the paragraphs/columns and/or textual cues (looking for sentence continuation) to determine the correct text flow. Students with background or interest in computer vision, pattern recognition or natural language processing/computational linguistics are likely to find this project interesting and possibly suitable as a master thesis or final undergraduate project. ![]() |