EMBL-EBI at Google’s Summer of Code – EMBL-EBI Events

Public event/Ext Seminar event | Other

EMBL-EBI at Google’s Summer of Code

Summary

EMBL-EBI is a global leader in biological data. We develop and maintain open data resources and open-source software that support life science research worldwide. Our teams work at the intersection of biology, data science, and software engineering, building tools used daily by researchers across academia, healthcare, and industry.

Through Google Summer of Code (GSoC), EMBL-EBI mentors contributors to work on real-world open-source projects, helping them develop technical skills, domain knowledge, and experience contributing to widely used scientific software. The project ideas listed below reflect the breadth of work across EMBL-EBI and are designed to support contributors at different experience levels.


How to apply

How to apply

Google Summer of Code contributors apply directly through the GSoC platform, but we strongly encourage you to engage with EMBL-EBI before submitting your application.

Step 1: Explore the project ideas
Review the project ideas listed below and identify one (or more) that match your interests and skills. Each project includes information about expected outcomes, required skills, and difficulty level.

Step 2: Do some background reading
You are not expected to be an expert in the domain, but spending some time familiarising yourself with the relevant technologies, data resources, or scientific context will help you prepare a stronger application.

Step 3: Get in touch with us early
Once you have a project in mind, contact our GSoC helpdesk (helpdesk@ensembl.org) with the subject line “GSoC”. Please include:

If you are interested in proposing your own project idea, include a short description and the technologies you expect to use so we can assess mentor availability.

Step 4: Draft your application
Prepare your application well ahead of the deadline. A strong proposal clearly explains:

Mentors can provide guidance and feedback, but they will not write the application for you.

Step 5: Incorporate feedback and submit
Use any feedback provided to refine your proposal, then submit your final application via the official GSoC website before the deadline.

For more detailed advice, please see our Contributor Guide, which outlines what we look for in successful applications and how contributors engage with EMBL-EBI teams during the programme.


EMBL-EBI projects – overview

  1. Expose a subset of ENA REST Services as MCP
  2. Creating a knowledge graph from a subset of ENA and BioSamples data
  3. Annotation metrics reporting & analysis modules for the Ensembl Assembly/Annotation tracking app
  4. Development of a refinement tool to identify selenoproteins in Ensembl genesets
  5. Expanding a pipeline for small non-coding RNA (sncRNA) identification in Ensembl Genomes
  6. Expand genome metadata in Ensembl with AI tools
  7. BUSCO-Missing Investigator (BMI): a reproducible pipeline to explain “missing/fragmented BUSCOs”
  8. Ask VEPai. Trained chatbot interface for Ensembl VEP web.
  9. nf-core/vep: Extending and standardising Ensembl VEP workflow for nf-core
  10. Building a perturbation-aware LLM for multimodal in-silico perturbation modelling
  11. Designing an open access Ensembl GraphQL Workshop
  12. Standardised evaluation for microbiome dataset classifiers
  13. Sequence similarity networks for the visualisation and exploration of MGnify Proteins
  14. A genomic feature database in the browser
  15. Design and API-aware UI generation using MCP servers and Figma APIs
  16. Expose BioSamples Submission and Search Capabilities as MCP Tools for AI-Assisted Metadata Interaction

EMBL-EBI projects – details

1. Expose a Subset of ENA REST Services as MCP

Brief Explanation

The European Nucleotide Archive (ENA) provides a rich set of REST APIs that allow users to query genomic metadata, sequence records, and submission information. While these APIs are powerful, they are not directly accessible to modern AI agents or LLM-based tools that rely on standardized interaction protocols.

This project aims to expose a carefully selected subset of ENA REST services through the Model Context Protocol (MCP), enabling AI agents to interact with ENA programmatically in a safe, structured, and reproducible way. MCP acts as a bridge between large language models and external tools by defining explicit schemas, inputs, and outputs, preventing hallucinations and ensuring reliable access to authoritative data sources.

The student will design and implement an MCP server that wraps ENA REST endpoints (e.g. study metadata lookup, run/sample queries, accession search) and exposes them as well-defined MCP tools. The project focuses on correctness, usability, and extensibility rather than deep bioinformatics analysis.

This project is intentionally scoped to be beginner-friendly, with limited bioinformatics background required, and emphasizes software engineering, API design, and AI-tool integration.

Expected results

Required knowledge

Desirable knowledge

Difficulty

Medium

Length

Medium – 175h

Mentors

Senthilnathan Vijayaraja

2. Creating a knowledge graph from a subset of ENA and BioSamples data

Brief explanation

European Nucleotide Archive (ENA), one of the three major nucleotide databases in the world, is hosting over 70 PB of genomics data. LLMs are well-developed to parse unstructured data but less so with the structured data.

This project is to create a prototype of a knowledge graph (KG) to make the database directly accessible to AI tools. A graph engine will be integrated with the existing structured data store to avoid duplicating data into a graph database. An AI-friendly Graph Query Language (GQL) will be used to interact with the KG backed by the relational data model via the graph engine dynamically. High profile LLM models are to be evaluated to generate GQL statements. The final output will be one or more AI agents to support Graph RAG to interact with a subset of the structured genomics data in ENA with the following characteristics:

  1. A working prototype capable of querying a small subset of ENA data (e.g. pathogen, AMR or data deposition analytics).
  2. A clear path to scale up the prototype to expand to all structured data in ENA.
  3. AI agent(s) with absolutely no hallucination.

Expected results

  1. Students would learn how AI components are used to construct agent workflows.
  2. Students would gain firsthand experience how to create working prototypes beyond “hello-world” toys.
  3. Students would be able to create standalone AI agents capable of interacting with ENA data but with minimum dependency.
  4. Students would be able to apply the knowledge learned in the summer school to create Graph RAGs on any structured data.

Required knowledge

  1. AI-friendly GQL (e.g. Gremlin)
  2. Graph engine (e.g. PuppyGraph)
  3. Python and libraries for AI-agent construction (e.g. LangChain)
  4. Methodology for benchmarking Graph RAG and GQL

Desirable knowledge

  1. ENA schema and tagging mechanism
  2. Kubernetes and its scaling
  3. Scalable local deployment of LLM models (e.g. Ollama)

Difficulty: High

Length: 350h

Mentors: David Yuan

3. Annotation metrics reporting & analysis modules for the Ensembl Assembly/Annotation tracking app

Brief explanation

Ensembl maintains an internal web application to track genome assembly status (e.g. candidates for annotation), Ensembl annotation status, and associated quality/completeness metrics. While the app stores important annotation completeness scores and other quantitative measures, it currently lacks richer reporting and comparative views that help users quickly interpret genome annotation quality across many species.

This project focuses on designing and implementing Python-based analysis modules for genome annotation metrics and comparative analysis, with a strong emphasis on clean workflows, test coverage, and maintainability. These modules will generate per-genome reports and perform comparative analyses across taxonomic groupings, enabling annotators and production teams to identify unusual annotations, trends, and priorities in a reproducible and testable way.

There is an opportunity to integrate the resulting modules into the existing tracking web application, but web/UI integration is not essential to the core project. The primary goal is to produce robust, well-tested backend analysis components that can later be surfaced via the app or reused in other contexts (e.g. batch reporting, pipelines).

Expected results

Deliverable 1: Per-genome annotation metrics report module

Deliverable 2: Taxonomy grouping + comparative analysis module

Final project output

Required knowledge

Desirable knowledge

Difficulty: Beginner

Length: 175h

Mentors: Anna Lazar, Simarpreet Kaur Bhurji, Leanne Haggerty

4. Development of a refinement tool to identify selenoproteins in Ensembl genesets

Brief explanation

Selenocysteine-containing proteins (selenoproteins) play crucial biological roles, but their annotation remains challenging due to the unique incorporation of selenocysteine (Sec, U) at UGA codons. Currently, Ensembl uses Exonerate to align known selenoproteins to genomes and manually verifies models based on sequence identity and coverage. However, the existing approach is inefficient and outdated, requiring a more scalable and automated solution.

This project will develop a Nextflow pipeline to efficiently annotate selenoproteins that can be applied to Ensembl gene sets by:

  1. Optimising the search for selenoprotein homologs
    • Aligning known selenoproteins against the genome using more efficient tools like MMseqs2, DIAMOND, or TBLASTN.
    • Filtering candidate regions based on sequence similarity, focusing on high-identity and high-coverage matches.
  2. Improving selenocysteine validation
    • Detecting UGA codons in aligned models and verifying the presence of SECIS elements (selenocysteine insertion sequences) in downstream regions.
    • Ensuring selenocysteine positions match the reference protein sequences.
  3. Automated filtering and quality control
    • Retaining only models with expected coverage and sequence identity to known selenoproteins.
    • Benchmarking against accurate but computationally intensive dedicated selenoprotein annotation tools 
    • Benchmarking against accurate but computationally intensive dedicated selenoprotein annotation tools 
    • Removing false positives by integrating BUSCO-like completeness scoring with clade specific selenoprotein sets.
    • Generating quality assessment reports.
  4. Deployability and scalability
    • Implementing the pipeline in Nextflow to improve reproducibility and scalability across multiple genomes.
    • Providing Docker/Singularity containers for easy deployment in HPC and cloud environments.

The final pipeline will be integrated within Ensembl’s genome annotation pipeline to be integrated within Ensembl’s genome annotation pipeline to streamline selenoprotein identificationidentification, thus thus improving accuracy, efficiency, and automation.

Expected results

Required knowledge

Desirable knowledge

Difficulty: Medium

Length: 175h

Mentors: Jack Tierney

5. Expanding a pipeline for small non-coding RNA (sncRNA) identification in Ensembl Genomes

Brief explanation

This project aims to enhance an existing pipeline for identifying small non-coding RNAs (sncRNAs) in Ensembl genomes. Building on the current MirMachine modules, the pipeline will be expanded to incorporate additional analyses using RFAM and miRBase databases.

Further improvements will include running sequence similarity searches with NCBI-BLAST and generating structural models using the Infernal software suite. The final pipeline will be optimised for flexibility, supporting various input sources, and containerised using Docker/Singularity to ensure reproducibility and shareability.

Expected results

Required knowledge

Desirable knowledge

Difficulty: Medium

Length: 175h

Mentors: Jose Perez-Silva, Vianey Paola Barrera Enriquez

6. Expand genome metadata in Ensembl with AI tools

Brief explanation

Ensembl Plants and Ensembl Metazoa import publicly available genome assemblies and their annotations from community contributors. Whilst assemblies are submitted to INSDC sequence archives, it is often the case that these submissions are missing some key information that can usually be found in the paper publication corresponding to that assembly (most frequently due to those metadata fields being not available in the submission process). This metadata is not useful useful for our users, but Ensembl can benefit from it, e.g. polyploid genomes require different processing parameters/tools than diploid genomes when it comes to comparative genomics. Current AI tools are making fetching such metadata from research papers much easier, so we would like to build a standalone module that performs such task with the ultimate goal to incorporate it into our genome loading pipeline.

Expected results

Required knowledge

Desirable knowledge

Difficulty: Medium

Length: 175h

Mentors: Jorge Alvarez, Disha Lodha

7. BUSCO-Missing Investigator (BMI): a reproducible pipeline to explain “missing/fragmented BUSCOs”

Brief description

BUSCO is widely used to measure assembly completeness, but after seeing “Missing” (and often “Fragmented”) BUSCOs, users still need to answer: why are these genes missing and what should I do next?
This project proposes a reproducible, best-practice pipeline/tool that takes BUSCO outputs (and optionally assemblies/annotations/reads) and automatically gathers evidence to generate interpretable, ranked explanations per BUSCO along with a clean, actionable report.

Expected results

Required skills

Desirable skills

Difficulty: Medium

Length: 175  hours

Mentors: Swati Sinha, Jitender Jit Singh Cheema

8. Ask VEPai. Trained chatbot interface for Ensembl VEP web

Brief explanation

Ensembl VEP is a widely used tool (10+ million dockerhub pulls alone) enabling the annotation and prioritisation of genetic variants and is used extensively in academic research and clinical assessments.

This project would be to prototype an AI chatbot configuration interface for the version of Ensembl VEP run from the new Ensembl website. The current selection of options for running the web version of Ensembl VEP is extensive, and requires users to be experienced or willing to read lots of tooltips and help documentation. A better option would be if they were able to describe their data and what they’re trying to achieve then receive a set of suggested options, with justifications. They could then click to apply these to the configuration before Ensembl runs.

Each Ensembl VEP option would be assessed and labelled and weighted appropriately. We would then identify an appropriate base chat-bot model and assemble a corpus of training data, from a mixture of our responses to users and specific constructed examples. These would be divided into training and test sets for first training the model and then assessing responses.

If this is completed, an optional extension of the project would be to produce a simple API wrapper for IO.

Expected results

Required knowledge

Learning outcomes

Gain experience with data annotation and agent model training and testing, supporting a globally utilised genetic variant annotation tool.

Difficulty: Medium

Length: 175 hours

Mentors: Likhitha Surapaneni

9. nf-core/vep: Extending and standardising Ensembl VEP workflow for nf-core

Brief explanation

The goal of this project is to design, develop, and document an nf-core pipeline for the Ensembl Variant Effect Predictor (VEP) that follows nf-core best practices, fully modularizes the existing Nextflow VEP workflow from the Ensembl repository, providing required testing and continuous integration. This project will bring the Nextflow VEP workflow inline with nf-core standards, providing greater usability for the community.

The Ensembl VEP is a widely used variant annotation tool capable of producing rich functional annotations for genomic variants. It has been part of different bioinformatics workflows. A Nextflow workflow already exists that leverages nextflow parallel processing capabilities (e.g., splitting VCFs, parallelizing chromosome analysis, and merging results), but it is not packaged as an nf-core pipeline with the community standards around modularity, container support, automated testing, documentation, and configuration profiles.

Expected results

Required knowledge

Desirable knowledge

Learning outcomes

Enhanced understanding of the structure and workflows required for production pipelines. Appreciation of community standards implementation and the generation of reliable, repeatable and reusable workflows.

Difficulty: Medium

Length: 175 hours

Mentors: Syed Hossain

9. Building a perturbation-aware LLM for mMultimodal in-silico perturbation modelling

Brief explanation

Recent advances in single-cell foundation models and perturbation-driven datasets are bringing the concept of a “virtual cell” closer to reality. However, most current models remain siloed by modality (CRISPR screens, MAVE, scPerturb-seq) and lack a unifying layer that can integrate causal perturbation knowledge across data types.

In this project, the student will build a prototype perturbation-aware large language model (LLM) by fine-tuning an existing open-source model on curated perturbation datasets from the Perturbation Catalogue. The goal is not to train a foundation model from scratch, but to explore how LLMs can act as a knowledge-integration layer that connects genetic perturbations, variants, and single-cell responses.

The project directly supports the emerging “lab-in-the-loop” and scPerturb-seq Atlas concepts, where models guide experimental design and hypothesis generation by predicting cellular responses to unseen perturbations. The student will prototype workflows for:

This will position the Perturbation Catalogue as a core resource for next-generation in silico perturbation modelling and virtual cell development.

Expected results

By the end of the project, the student will deliver:

Required knowledge

Desirable knowledge

Difficulty: High

Length: 350h

Mentors: Alexey Sokolov, Kirill Tsukanov, Aleksandr Zakirov

11. Designing an open access Ensembl GraphQL Workshop

Brief explanation

The Ensembl GraphQL service can be used to access information about genes, transcripts, assemblies and associated metadata held by Ensembl.This project will be conducted in collaboration with the Ensembl Outreach and Platform team to develop a freely available hands‑on, workshop teaching participants how to query Ensembl data using GraphQL. The workshop will include modules covering an introduction to GraphQL, schema exploration, query building, and techniques for error handling and debugging. As part of the project, the participant will create documentation with example prompts that can be used with AI assistants (e.g. Gemini) to help generate valid GraphQL queries or assist in debugging scripts. The workshop will be designed to be reproducible and easily extendable, enabling integration of future Ensembl GraphQL resources.

This experience will provide a mentored learning pathway focusing on practical software and data science skills, preparing the contributor for future open-source work.

Learning objectives

Aims

Expected results

Required knowledge

Desirable skills

Difficulty: Medium

Length: 175 hours

Mentors: Aleena Mushtaq, Bilal El Houdaigui

12. Standardised evaluation for microbiome dataset classifiers

Brief explanation

Accurate metadata is essential for interpreting and comparing microbiome datasets. Despite its importance, it often remains incomplete or inconsistent in life-science public repositories. Trapiche is a metadata classification tool for microbiome datasets that combines microbial composition (taxonomic) profiles with free text from project and sample descriptions. The base models can be repurposed for other classification tasks, but users currently lack a simple, standardised way to evaluate model quality and interpret results.

This project will develop an evaluation and reporting toolkit for Trapiche that automatically produces standardised metrics and human-readable reports. A key focus will be to monitor and compare the contribution of both input components: the taxonomic profiles and the text features. This will allow users to understand not only how well models perform, but also how each input type influences the predictions.

The resulting module will shorten development cycles for new microbiome classification tasks and support more reliable, comparable, and reusable life-science datasets.

Expected results

Required knowledge

Desirable knowledge

Difficulty: Medium

Length: 175 hours

Mentors: Santiago Fragoso, Mahfouz Shehu

13. Sequence similarity networks for the visualisation and exploration of MGnify Proteins

Brief explanation

In this project, we aim to develop a prototype method for the generation of sequence similarity networks (SSNs) for the MGnify Proteins database to help enable graph-based analyses of its sequence space.  Using tools like MMseqs2 to compute pairwise sequence similarities, and Python graph libraries like NetworkX, a collection of representative SSNs will be generated for a small subset of the database of ~10 million proteins. The nodes of these networks will then be annotated with relevant MGnify metadata, starting with biome of origin. Finally, we will generate visualisations of these annotated SSNs to be displayed on the MGnify Proteins website using modern graph rendering tools like Cytoscape and Cosmograph.

The latest release of the MGnify Proteins Database contains over 2.4 billion non-redundant protein records including relevant metagenomics metadata. The visualisation of sets of protein sequences using SSNs is a common approach for extracting novel insights about protein-protein relationships, including functional, structural, and evolutionary hypotheses. Facilitating the generation of SSNs for the MGnify Proteins database would therefore be a significant contribution to open metagenomics science.

Expected results

Required knowledge

Desirable knowledge

Difficulty: Beginner

Length: 350 hours

Mentors: Christian Atallah

14. A genomic feature database in the browser

Brief explanation

Interactive data science web applications often need to support efficient search over large structured datasets, while keeping latency low and avoiding heavy server-side infrastructure. At large scale, this can be done in several ways, for example: 1) precompute an index file to accompany the dataset; 2) load records into a server-side indexed database behind a REST API; or 3) index and query data in the browser (e.g. using IndexedDB).

In bioinformatics, annotating (meta)genomes involves tagging regions of genomic sequences with feature details (like the location of a gene and its function). Computational pipelines produce these annotations and output standardised formats like GFF (General Feature Format) – effectively a TSV file for genomics. There are various ways to interrogate and visualise these annotations, including genome browsers like JBrowse. A frequent use case is to search the annotations by a query such as a function category label, and then browse to the matching locations in the sequences. Like any database, this becomes challenging for large datasets – in particular the metagenomes we analyse in MGnify become very large.

The objective of this project is to try a mixed approach: convert GFF (and other) files into a SQLite database using gffutils, creating extra database indexes at the same time. We would like to distribute this feature SQLite to the browser, and query it using the sqlite3 WASM in-browser capabilities to both display a feature search interface and pass data to JBrowse (perhaps via a new plugin).

Expected results

Required knowledge

Desirable knowledge

Difficulty: Medium

Length: 175 hours

Mentor: Vikas Gupta

15. Design and API-aware UI generation using MCP servers and Figma APIs

Brief explanation

MGnify’s web interfaces are built on a large and evolving API surface, with complex data relationships and established frontend patterns. Translating Figma designs into production-ready UI code currently requires significant manual effort, particularly when wiring components to backend endpoints and maintaining consistency across the application.

MGnify already has a prototype (Model Context Protocol) server that exposes tools backed by existing API endpoints. However, coverage is currently partial and focused on selected workflows.

This project proposes extending and integrating an existing MCP server prototype with the Figma API, enabling a design- and API-aware pipeline that assists developers in generating frontend components that are:

The system will act as developer-assist infrastructure, reducing repetitive boilerplate work and accelerating the design-to-implementation cycle while preserving full human control.

Expected results

By the end of the project, the student will deliver:

End-to-end proof of concept

Learning outcomes

Through this project, the student will gain experience in designing and implementing production-grade developer tooling that integrates design systems, APIs, and modern web frameworks. Specifically, the student will learn to:

  1. Work with large, real-world APIs
  2. Understand and extend an existing MCP server exposing a complex, evolving API surface
  3. Reason about API schemas, relationships, pagination, and error handling
  4. Design abstractions that remain stable as backend APIs evolve
  5. Learn to integrate with external APIs
  6. Responsibly build AI assisted developer tools

Required knowledge

Desirable knowledge

Difficulty: Medium 

The project involves real-world system integration and design decisions, but is well-scoped and suitable for a student with solid web development fundamentals.

Length: 175  Hours

Mentors: Mahfouz Shehu

16. Expose BioSamples submission and search capabilities as MCP tools for AI-assisted metadata interaction

Brief explanation

The BioSamples database at EMBL-EBI provides a central repository for the storage, validation, and retrieval of biological sample metadata across a wide range of life science domains. BioSamples plays a critical role in ensuring that sample descriptions are structured, standards-compliant, and reusable across downstream archives such as ENA, ArrayExpress, and others.

Despite the availability of REST APIs for sample submission, validation, and search, these interfaces are not directly accessible to modern AI agents or large language model (LLM)–based systems, which require explicit schemas, deterministic interactions, and well-defined tool boundaries. As a result, the use of AI for assisting users in preparing high-quality BioSamples submissions or performing structured sample discovery remains limited.

This project aims to design and implement a BioSamples MCP server that exposes a carefully selected subset of BioSamples submission and search functionality through the Model Context Protocol (MCP). The system will enable AI agents to interact with BioSamples in a safe, structured, and reproducible manner, reducing metadata errors while improving usability for submitters and data consumers.

The project focuses on two complementary capabilities:

  1. AI-assisted sample submission from plain-text descriptions, with interactive clarification and validation against BioSamples checklists.
  2. Natural-language-driven sample search, converting free-text queries into structured BioSamples search requests.

By leveraging MCP’s explicit tool schemas, input/output contracts, and error semantics, the project prevents hallucinations, enforces metadata correctness, and ensures that all responses are traceable to authoritative BioSamples data sources.

The project is intentionally scoped to be beginner-friendly, requiring limited domain-specific bioinformatics knowledge, and emphasizes software engineering, API design, schema validation, and AI-tool integration rather than biological interpretation.

Project objectives

The primary objectives of the project are to:

Scope and functionality

1. AI-Assisted BioSamples Submission

The system will allow users to describe a biological sample using plain natural language, for example:

“Human liver biopsy collected in London in 2023 from a patient with cirrhosis.”

The MCP server, in combination with an LLM-based agent, will:

The final output will be a validated, submission-ready BioSamples sample object, with all validation decisions and user interactions explicitly traceable.

2. Natural-Language BioSamples Search

The project will also support plain-text search queries, such as:

“Human blood samples collected in Europe after 2020 related to diabetes.”

The system will:

This enables AI agents to act as structured discovery interfaces while preserving the determinism and correctness of the underlying BioSamples queries.

Expected results

By the end of the project, the following deliverables are expected:

Advanced capabilities (optional / stretch goals)

Depending on time and interest, the project may additionally explore:

Documentation requirements

The project will include comprehensive documentation covering:

Required knowledge

Desirable knowledge

Difficulty: Medium

Length: 175 hours

Mentor: Dipayan Gupta


Contributor Guide

Properties of a successful application

To be successful with your application, it is important to demonstrate the following:

1. An understanding of the major aims of the project

We do not expect contributors to have expert domain knowledge at the outset. However, some light background reading on the proposed technologies and underlying science will help you better understand the project context and goals.

2. An ability to build on the project idea

We provide a set of project ideas as starting points. Strong applications go beyond simply restating the description and instead bring new ideas, questions, or alternative approaches that build on the initial outline.

3. Clear and appropriate communication with mentors

Engaging with potential mentors ahead of submitting an application is key to success. Mentors are available to answer questions and provide guidance, but they will not write your application for you.

If you need clarification or additional background, communicate this clearly and in good time. Be concise and specific in your questions. Last-minute requests for substantial feedback are generally a sign of poor planning.

4. A realistic and well-structured timeline

Although GSoC timelines are flexible and can sometimes be extended, the programme is still relatively short. A good application includes:

We value sustainable working practices and do not expect contributors to work excessive hours. Availability and constraints should be clearly stated and discussed with mentors.

5. Genuine enthusiasm and engagement

Demonstrated interest in the project, the technologies involved, and working with EMBL-EBI teams goes a long way. Enthusiastic and engaged contributors tend to have more productive mentor relationships and more successful projects.


Steps to building an application

The steps below provide a general guide to submitting a strong application. While we publish a list of suggested projects, contributor-proposed ideas are also welcome.

  1. Review the project ideas
    Review our GSoC project ideas page to explore available projects and their associated technologies.
  2. Select a project of interest
    Read the description carefully and do some light background research if needed.
  3. Get in touch with us
    Contact our GSoC helpdesk (helpdesk@ensembl.org) with the subject line “GSoC”. Please include:
    • a short CV or link to relevant experience
    • a brief explanation of your interest in the project
    • any specific questions you may have
  4. If you are proposing your own project idea, include a short description and the technologies you expect to use so we can assess mentor availability.
  5. Draft your application early
    Prepare a first draft well ahead of the deadline and share it with your mentor(s) or via the helpdesk for feedback.
  6. Incorporate feedback and finalise
    Use the feedback provided to refine your proposal, then submit the final version once it has been reviewed.

Community and collaboration

GSoC contributors at EMBL-EBI are treated as members of their project teams. Contributors typically engage through:

Where time zones permit, contributors may also attend team or section-wide meetings and may be invited to present their work during the programme.


Final notes

Good luck with your application. GSoC has consistently been a rewarding experience for both contributors and mentors at EMBL-EBI, and we look forward to supporting contributors in developing skills, gaining domain knowledge, and contributing to open scientific software.


Helpful links

GSoC Resources
https://google.github.io/gsocguides/student/ 


EMBL-EBI resources and services
We develop and maintain a wide range of open biological data resources, including:

Code repositories

Date: 16 - 31 Mar 2026


Location: Virtual

Venue: Online