Maintaining the central EMBL-EBI websites and ServiceNow platform, and providing web design, development and user experience support.
Last week we noticed a problem after deploying a new version of the EMBL-EBI homepage. The links to the upcoming events were taking a long time to load. A really, really, long time. It wasn’t that the page was taking a long time to render, but the initial delivery of the HTML page was taking over 10 seconds when it wasn’t hitting the cache. That was obviously much longer than acceptable! Time to investigate this performance problem.

First we validated that this wasn’t a one off. Using the Elasticsearch, Logstash, Kibana (ELK) stack that runs over the EMBL-EBI web logs provided by the Web Production team we were able to validate that several pages were taking an incredibly long time to load. Specifically it was events and course pages. This gave us a clue as to where this issue was.
Our public information site, e.g. the informational areas of the site rather than the scientific service applications uses Drupal. It was the initial HTML page response that was slow suggesting something in Drupal was the culprit.
This gave us an opportunity to apply a couple of immediate ‘band-aid’ mitigations. Firstly increasing the caching time, and secondly warming this cache for the next 6 events that were linked from the homepage. This reduced the impact of this issue, but did have a knock on impact for those producing content, as they would now have longer waits before changes to pages would be public as the caches cleared over hours. We let the content editors know about these temporary measures, and then started to investigate the route cause.

We create these event pages by feeding information from our Content Database (a headless Drupal instance) into our public facing Drupal instance using an XML feed. The public instance then uses a series of views to render the page based on these XML files. We’d chosen at the time of implementation to use a single feed containing all events, a choice acknowledged at the time was not going to be scalable forever. Over time, and with the rate EMBL-EBI hold events this XML file had grown to contain many 100s of events. One of our intrepid developers walked the code and confirmed that we were loading a 18MB XML file 16 times for every page load! Suspect identified. After making a couple of tests locally using smaller XML files we’d found our route cause.
The first change we made was to alter the rendering of these pages to reduce the number of Drupal blocks involved in these views. That reduced the number of times each page load had to read these files. Ideally that number should be once, but that will take some more rearchitecting and will need to wait until we’re out of the woods for this issue.
The second change was to rearchitect this system to use one small XML file per event page, rather than one giant file used for all events. This change made a massive reduction to page load time in development. Some more testing and then this change was deployed and monitored.

We’ve learnt more about our Drupal system, the diagnostic tools available for us to use in our environments, and identified an architectural pattern that we should avoid going forward as it doesn’t scale. We’re continuing to evolve our Content Database, and this is useful feedback input to future development.
We’ve also started to continually track and monitor the page loading times for key pages. This will let us see if changes we make, or changes based on the volume of content are having an impact on these key metrics. We’re now capturing these load times in a Graphite/Graphana instance, and get alerted if they creep over a threshold.

The act of measuring this metric made us focus more on performance. We’d resolved an acute issue in the initial investigation, but we were still unsure about why some of these relatively simple pages were taking so long for Drupal to generate. Especially as we run a number of Drupal sites and could see must faster response times for other sites of similar complexity.
Using the Xhprof performance profiling tool for PHP we investigated where time was being spent and identified a Drupal module (that we were no longer using) that was very inefficient and being called on every page load. Removing this module made a signifiant improvement to all page response times.
