Skip to content Skip to footer

research software story: DOME Registry

The Problem

Addressing the reproducibility crisis in life science machine learning caused by inconsistent and opaque reporting practices

In the life sciences, the reuse of supervised machine learning (ML) research outputs is hampered by inconsistent, opaque, and non-reproducible methods driven by poor reporting practices. Despite the wide availability of high quality, FAIR data and advanced algorithms to build these ML methods, individual researchers document their ML outputs—covering areas such as data, optimisation, models, and evaluation—in ad hoc, free-text formats, often omitting critical biological context and technical details.

This fragmentation in reporting methods makes it nearly impossible to reliably validate, compare, or build upon published ML models, creating a fundamental reproducibility crisis. The lack of a standardised, domain-aware reporting framework meant that even with the best intentions, life science ML studies remained isolated, difficult to trust, and slow to translate into cumulative scientific progress. In response to the challenges, the DOME Registry was set up as a service to support the creation of standardised ML method disclosures to supplement publication text. The DOME Registry is now an operational solution to support both researchers and journal publishers in the production of high quality and transparent ML methods.

The Community

A collaborative ecosystem connecting biological researchers, ML practitioners, and infrastructure providers

The intended user community of the DOME Registry is multifaceted. It is primarily composed of three interconnected groups with different DOME Registry usage needs:

  • Biological Researchers & Data Producers: These are experimentalists and bench scientists who generate or work with complex biological datasets (e.g., genomic, imaging, clinical) and wish to apply ML. They use the registry to discover validated ML approaches relevant to their domain, report their own work transparently, and ensure their models are reusable, thereby increasing the impact and credibility of their computational findings.
  • ML Practitioners & Bioinformaticians: This core user group includes data scientists, computational biologists, and ML engineers who develop and implement algorithms. They rely on the registry to structure and standardise their model reporting, benchmark their methods against community standards (DOME), and share their pipelines in a findable, citable way. For them, it is a tool for professional rigour and recognition.
  • Consumers & Integrators: This group includes journal editors, reviewers, funders, and infrastructure providers (like public databases). They use the registry as a curated validation aid to assess the reusability and reproducibility of submitted ML studies, to encourage or mandate standards compliance, and to integrate high-quality ML models into larger biological data resources and workflows.

Ultimately, the DOME Registry serves the broader ELIXIR and global life science community by fostering a culture of transparency. Its success depends on engaging all user groups—from the biologist seeking a trustworthy model to the developer sharing one—in a collaborative ecosystem dedicated to improving the reliability of ML-driven discovery.

Technical Aspects

The DOME Registry is implemented as a full-stack web application, specifically designed as research software infrastructure. It is a production-grade, community-serving web platform, not a prototype or analysis tool based on the EVERSE categorisation levels.

Technically, it is built with TypeScript across the stack, utilising the Angular framework for its dynamic single-page frontend and NestJS for its modular, service-based backend, with data served via a flexible MongoDB database. The codebase is actively maintained by the ELIXIR community, specifically the core maintainers at the University of Padua. Deployment is designed for standard web hosting, requiring a Node.js environment and MongoDB instance without dependence on high-performance computing; its primary infrastructural need is robust database storage for the growing corpus of curated ML metadata. In essence, the software is a database-backed API service with a rich client interface that systematically ingests, validates, and disseminates structured descriptions of biological machine learning workflows.

Libraries and Systems

The software relies on a specific stack of modern web technologies and domain integrations.

Software Practices

Balancing established engineering standards with the challenges of rotational academic development

The DOME Registry is developed using established software engineering practices to ensure quality, transparency, and long-term maintainability. It uses a Git-based version control system (GitHub) with a clear contribution policy, adopts open-source licensing (CC-BY 4.0), and follows a structured workflow and semantic versioning for releases.

  • Version Control: All source code is managed in a Git repository (GitHub) (BioComputingUP/dome-registry), with feature development isolated in branches. The workflow is GitHub flow-like, though deletion and deprecation of branches could use improvements. The main branch reflects stable, production-ready code, and changes are integrated via pull requests by team members, with integrations and deployment managed by the lead developer.
  • Code Review & Testing: Code review is mandatory for all pull requests, ensuring adherence to standards and catching issues early. The lead developer ensures this step is completed and advises more junior developers on future contributions to build technical competencies across the team. There are currently no formalised testing practices, but this will be considered in the future as team bandwidth allows.
  • Decision Making: Significant technical or directional decisions are made through discussion within the ELIXIR AI Ecosystem Focus Group, often informed by user needs and community feedback. One major focus is breaking silos to interface the DOME Registry with new standards and other ELIXIR Europe services (e.g. Europe PMC). A Scientific Advisory Board (SAB) meets annually to review and inform the development plan, with an annual roadmap logged and output via Zenodo (first release planned for December 2025).
  • Communication: Formal changes could be better tracked through GitHub Issues and pull requests; work is in progress to professionalise this aspect. Real-time coordination occurs via team chat tools (Slack) and regular development meetings (Zoom).
  • Informal Practices: Informal yet crucial practices include ad-hoc troubleshooting chats, shared documentation of “tribal knowledge” in Google Drive workflow files, and a culture of mentorship and cross-training to sustain the project through contributor turnover. Trust and open communication are key to maintaining momentum.

Community

Streamlining the path for new developers and contributors through mentorship and guided tools

The DOME Registry is developed and maintained by a diverse, rotational team anchored at the University of Padua (UNIPD). The core development team consists of academic staff at various career stages, including two PhD researchers who provide multi-year stability and IT/engineering focused interns who contribute targeted technical work. Because of this rotational nature, the project relies on a strong “handover” culture where knowledge is actively transferred to new contributors to maintain continuity. While UNIPD serves as the primary development hub, the infrastructure is supported by a mirror deployment at CERTH (ELIXIR Greece), which provides operational resilience without requiring a separate development team. Beyond the core staff, the community expands through international collaborations via ELIXIR projects or collaborations in the EOSC frameworks such as with the AI4EOSC project.

How DOME Registry onboards new developers:

  • Workflow: New developers, often interns or project partners, start by cloning the GitHub repository. Work is commonly organised via GitHub Issues, and GitHub Projects is being considered to bring code, strategy and management into one location.
  • Mentorship: Due to the “handover” nature of the project, new staff typically pair with senior PhD researchers or the lab technician to learn the codebase and deployment nuances.
  • Obstacles: The primary hurdle is the local environment setup. While documentation exists, the lack of full Docker containerisation means initial configuration of Node.js and database connections can be time-consuming. Work is underway to ensure Docker for both local use and full deployment to reduce this burden.

How DOME Registry handles new content contributions:

  • Submission Page: New contributors start at the Submission Page, which guides them to the DOME Wizard content generation platform.
  • The DOME Wizard: Users start with the integrated DOME Data Stewardship Wizard (DSW) instance, which guides them through reporting ML methods in a step-by-step process.
  • Training: Users rely on the DOME Training Course and specific video tutorials to understand the content generation and submission workflow.

Tools

Utilising a modern web development toolchain while working to bridge gaps in automation

The project employs a well-integrated toolchain to ensure high software quality and streamlined development.

  • Linting & Formatting: ESLint is used for code quality and Prettier for consistent code styling.
  • Version Control & Tracking: GitHub serves as the central hub for version control, issue tracking, and documentation.
  • Package Management: Node Package Manager (NPM) handles dependencies.
  • Productivity: Github Copilot is used to assist with developer productivity and accelerate progress.
  • Containerisation: Full containerisation with Docker is planned but not yet implemented. Currently, the environment is managed via detailed setup instructions and version-locked dependencies (package.json) to ensure consistency across local setups.
  • Future Plans: GitHub Actions, automated dependency updates (Dependabot to maintain security with minimal effort), and AI code generation are being considered to accelerate development progress, though risks regarding codebase stability and knowledge retention are acknowledged.

FAIR & Open

Adhering to FAIR principles for research software to ensure findability, accessibility, interoperability, and reusability

The DOME Registry aspires to fully adhere to the FAIR principles for research software, ensuring it is a reliable component of the Open Science ecosystem.

  • Findable: The software is indexed in major research meta-discovery platforms (e.g., FAIRsharing) and utilises BioSchemas markup to make entry metadata harvestable. The project is published in high-profile journals (Nature Methods, GigaScience) with persistent DOIs. Cross-linking to Zenodo is planned for archiving specific versions, and Europe PMC indexing is currently being worked on.
  • Accessible: The registry is free to use via a web interface. All source code is available on GitHub (BioComputingUP/dome-registry) under an open license (CC-BY 4.0), allowing inspection of the underlying logic. A public REST API allows programmatic access to all public registry data without authentication barriers. Identifiers.org is used to ensure accessible entry resolution.
  • Interoperable: The platform uses standard data formats (JSON) and adheres to the DOME recommendations for ML reporting. It integrates seamlessly with the Data Stewardship Wizard (DSW). The separation of frontend (Angular) and backend (NestJS) allows for independent integration with other tools.
  • Reusable: The “Mirror” deployment at CERTH proves the software is not tied to a single institution’s infrastructure and can be redeployed. Documentation and permissive licensing ensure that the code and content can be reused, modified, and built upon by the wider scientific community.

Documentation

Empowering users and developers with comprehensive guides, video tutorials, and API references

Documentation is distributed across GitHub and the web platform to support both technical contributors and end-users.

Sustainability

Securing long-term viability through multi-institutional mirror deployments and dedicated core funding

The DOME Registry operates as a resilient multi-node service delivered jointly by UNIPD (ELIXIR Italy) and CERTH (ELIXIR Greece). This dual-institutional approach ensures operational redundancy, while a 10-year service guarantee provides long-term stability beyond typical grant cycles.

  • Financial Sustainability: The project is supported by a mix of permanent core funding and competitive grants. A key pillar is the ELIXIR NextGenIT grant (Jan 2024 - Dec 2028), which includes a signed guarantee from the University of Padua to provision the DOME Registry for the next 10 years. Other funding sources include ELIXIR ML SIS (Machine Learning Structural Implementation Study), EVERSE EC, and STEERS EC.
  • Operational Maintenance: The core infrastructure is maintained by the Biocomputing Lab at UNIPD with a dedicated full-time technician. Services are partially mirrored (front end) at CERTH to ensure high availability and fault tolerance.
  • Content Strategy: Current curation is manual, but scaling plans involve implementing LLM-based triage models. Active integration with publishers (e.g., GigaScience, Gigabyte) ensures a continuous stream of new content.
  • Risk Mitigation: Technical failure is mitigated by the mirror setup and regular backups. Staffing gaps are addressed by permanent technician roles and overlapping PhD cycles. In the unlikely event both labs cease operations, the open GitHub repository allows the community to fork and redeploy.

References

The project is shaped by community standards, key tools, and academic publications.

  • Code & Registry: GitHub Repository ([BioComputingUP/dome-registry])
  • Primary Publication: Attafi, O. A., et al. (2024). DOME Registry: implementing community-wide recommendations for reporting machine learning in biology. GigaScience. DOI: 10.1093/gigascience/giae094
  • Community Standards: Walsh, I., et al. (2021). DOME: recommendations for machine learning validation in biology. Nature Methods. DOI: 10.1038/s41592-021-01205-4
  • Integrated Tool: Pergl, R., et al. (2019). “Data Stewardship Wizard”: A Tool Bringing Together Researchers, Data Stewards, and Data Experts around Data Management Planning. Data Science Journal. DOI: 10.5334/dsj-2019-059
Skip tool table

Tools and resources on this page

Tool or resource Description Related pages
Dependabot Generate automated pull requests updating dependencies for projects
Docker Docker is a tool for creating isolated environments (application isolation) for software development called containers to enable consistent software running across platforms. Docker allows developers to build, share, run and verify applications easily. DockerHub is a repository for sharing and managing container images. Archiving software Continuous Integration... Creating a good README Packaging software Reproducible software ... Use of containers
Git Distributed version control system designed to handle everything from small to very large projects with speed and efficiency Research Software Stor... Research Software Stor... Research Software Stor... Using version control
GitHub GitHub is a platform that allows developers to create, store, manage, and share their code. It uses Git to provide distributed version control. GitHub provides access control, bug tracking, software feature requests, task management, continuous integration, and wikis for every project. Archiving software Performing a code review Computational workflows Documenting code Documenting software p... Documenting software u... Packaging software Releasing software Using version control
GitHub Actions GitHub's infrastructure for continuous integration, deployment and delivery Continuous Integration... Documenting software u... Task automation using ... Task automation using ... Testing software
Github Copilot Github Copilot is a code completion and automatic programming tool developed by GitHub and OpenAI that assists users of Visual Studio Code, Visual Studio, Neovim, and JetBrains integrated development environments (IDEs) by autocompleting code.
Prettier Prettier is a code formatter to enforce a consistent style with its own rules for different languages, including JavaScript, TypeScript, Flow, JSX, JSON, CSS, SCSS, Less, HTML, Vue, Angular, GraphQL, Markdown, and YAML.
TypeScript TypeScript is a strongly typed superset of JavaScript that adds static typing and advanced tooling capabilities, enhancing code quality and developer productivity. Designed for building scalable applications, TypeScript compiles to plain JavaScript, ensuring compatibility with existing JavaScript environments. Choosing languages, to...
Zenodo Zenodo is a general-purpose open repository developed under the European OpenAIRE program and operated by CERN. It allows researchers to deposit research papers, data sets, research software, reports, and other research-related digital artefacts. Archiving software Documenting code Releasing software Software identifiers Software metadata
Contributors