7 minute read.An eLife filled with possibility thanks to great metadata
eLife recently won a Crossref Metadata Award for the completeness of its metadata, showing itself as the clear leader among our medium-sized members. In this post, the eLife team answers our questions about how and why they produce such high-quality open metadata. For eLife, the work of creating and sharing excellent metadata aligns with their mission to foster open science and supports their preprint-centred publication model, but it also lays the groundwork for all kinds of exciting potential uses.
Having complete and rich metadata puts you in the best position to fulfil future, as-yet-undetermined requirements.
– Fred Atherden, eLife
eLife is a mission-driven organisation tasked by its founders to help scientists accelerate discovery and encourage responsible behaviours in science. As such, weâre passionate about open science and metadata, and we’re vocal advocates of the benefits these provide to academic communities and beyond.
Given Crossrefâs position as a hub at the centre of scholarly communication, providing Crossref with complete metadata furthers our mission. It facilitates the discovery and reuse of research and enables linkage to key but often overlooked outputs such as datasets and software. As signatories of DORA and supporters of the Barcelona Declaration, we are keenly aware of the wider context - that these efforts enable research assessment and policy decisions to be derived from open and transparent information, moving beyond closed systems that have proliferated the damaging use of anachronistic metrics.
There are plenty of existing guidelines that provide a great skeleton to follow. For example, we follow FAIR data and FORCE11 software citation principles, which ensure the capture of metadata for supporting datasets and software packages. Thereâs not any one particular element that weâve prioritised, although weâre keen to ensure we follow best practices while also exploring the bleeding edge.
Weâve collaborated with and relied on the advice of many organisations over the years, including (but not limited to) Crossref, Research Organization Registry (ROR), JATS4R, FORCE11, Software Heritage, openRxiv, and our production vendors Exeter Premedia.
Weâve developed our own open source Crossref metadata generation library. Keeping this process in-house has proven really fruitful. It allows us to quickly and continuously improve upon the metadata we provide.
And we have a data team that has created a centralised data hub, serving as a really useful authoritative resource that can be queried, instead of always making use of disparate systems.
At submission, we collect ROR IDs for (a subset of) affiliations, and structured data for funding, datasets, and other information. Our publication model is centred around preprints, so itâs necessary to capture related information such as the preprint DOI, preprint posted date, the version that pertains to each specific revision (and so on). Without this information, we could not post public reviews to the correct preprint version on the preprint server, or indeed ensure the article we publish is the correct iteration of that work.
The systems that enable the publication of eLife Reviewed preprints are dependent on DocMaps, a framework for a machine-readable representation of the processes involved in the creation of a document. These are provided by our Data Hub and enable us to capture structured information about the peer review process and accompanying metadata for each article.
Our proofing system for journal articles only permits login via ORCID authentication, and we donât capture unauthenticated ORCID IDs that have been copied or keyed (see âWhatâs So Special About Signing In?â). It also makes use of both the Crossref API and the PubMed Central API to ensure we have persistent identifiers where possible for references. We have an in-house content validator, which uses RORâs API to ensure we have ROR IDs for affiliations and funders where possible. We use Software Heritage to archive author-generated code, and include their persistent ID (SWHID) in software references.
All our published content is captured as JATS XML (the industry standard format for journal articles), which our metadata generation library uses as its input.
Persistent identifiers are very useful for reporting. Creating a report that, for example, includes publication volumes from a particular institution is trivial when content is enriched with persistent identifiers. Itâs more complex when all you have are messy author-supplied strings of text. Theyâre also useful for content validation. For example, when we have a persistent ID and a method to retrieve the related metadata, we can confirm that the information weâve been provided is complete and correct.
There are, of course, many other benefits, some of which are “unknown unknowns.” Having complete and rich metadata puts you in the best position to fulfil future, as-yet-undetermined requirements.
In 2024, we started introducing persistent grant IDs for our content. While we updated our submission system to collect these from authors, itâs apparent that many authors arenât aware when/if these have been registered by funders, and they still provide us with the (internal) grant numbers instead.
Our workaround was to pull grant data from Crossref and then replace the grant numbers with the persistent IDs when weâre confident of a match. Since the grant number registered at Crossref might not exactly match the grant number the authors have given us, potential matches are confirmed by a team member or our production vendors. Since many organisations do a great job of creating informative landing pages (for example, EuropePMC for Wellcome funding), this is feasible, but weâre investigating ways we can make this less manual while remaining careful that we donât introduce false positives.
Yes, I think this is something that is becoming increasingly visible. Authors are very mindful of the benefits that good metadata can bring for discoverability and promotion. And much is lost without the increased interoperability it brings, both for publishers themselves but also the wider ecosystem. For example, weâve had some great feedback from numerous organisations that appreciate that the outputs we publish directly link to the preprints they are based on.
In recent years, thereâs been an increased focus on research integrity, and this is likely to remain the case. Metadata has an obvious and key role in providing trust and transparency, whether thatâs through the presence of trust markers like ORCID IDs or through the inclusion of complete post-publication metadata such as correction, retraction, or withdrawal information.
Several years ago, we introduced a “publish, review, curate” model of publishing, where we publish âReviewed preprintsâ following each stage of review. We donât collect the same level of structured information from authors at submission for these as we do for Versions of Record. This presents a challenge for retrieving and disseminating complete metadata for Reviewed preprints. We aim to start moving this forward so that comprehensive metadata is available at earlier stages of the publication process. For example, we recently started depositing (some) funding metadata for these.
Weâre also keen to explore the ways in which we can make our eLife Assessments more discoverable. Our Editors use a common vocabulary to describe the significance of the findings and strength of evidence in a paper. Other publishers moving beyond accept/reject publication models use different rubrics and taxonomies, so having one restrictive field in a schema for the entire corpus of research wonât cut it. But nevertheless making these terms more discoverable and interoperable would be preferential.
Weâve found that the integration of public APIs/data within systems (such as RORâs, Crossrefâs, PubMedâs, and OpenAlexâs) to be really helpful in validating the correctness and completeness of content/metadata. The effort in adding these integrations will pay dividends in the future.
Time to enjoy Fredâs acceptance video.
Metadata Awards video - eLife