German Medical Text Corpus
Many texts are produced in the course of routine patient care, including admission notes and diagnostic reports. These contain valuable information on a patient’s case history, disease progression and treatment. However, the unstructured nature of this documentation – combined with its lack of machine readability – makes it impossible to exploit its full potential. AI-based methods of automated natural language processing (NLP) offer a way to automatically structure the information contained in clinical documentation in order to support doctors and researchers in their work.
Training such NLP models requires the use of anonymised data, very little of which is currently available in German. This is the gap that the GeMTeX methodology platform is designed to fill. Its goal is to prepare clinical documentation created during patient care in a machine-readable format, thus making it available for use in medical research projects. Using clinical documentation from the information systems of six university hospitals, GeMTeX aims to create the largest anonymised medical text training corpus in the German language, enriched with annotations of entities and relations. Strong governance will provide a stable legal framework that will allow the corpus to be used in accordance with the regulations of the Medical Informatics Initiative (MII). GeMTeX will employ state-of-the-art NLP methods to build, pre-annotate and annotate the corpus and to train language models.
Automated indexing of medical texts for research
GeMTeX’s core mission is to collect a balanced mix of different types of clinical documentation from different medical disciplines and to make these accessible for NLP. De-identification and annotation of these texts is carried out by trained teams, who also add meta-information. The GeMTeX project also includes the creation of annotation guidelines to guarantee the anonymity of the training texts and the quality of the annotations.
Tools and methods for automated information extraction will be created alongside the text corpus, and cutting-edge deep-learning models will be trained and validated for specific application scenarios.
ZB MED’s role in the project
ZB MED is involved in several aspects of the GeMTeX project, including the development of annotation guidelines, the creation of quality control mechanisms for annotations, and the specification of the conditions under which access will be granted to the corresponding corpora. ZB MED’s goal is to establish standards for the discoverability and sustainable reusability of medical training corpora. To further improve its pipeline for medical information extraction, ZB MED will also use the training corpus created within GeMTeX to train and evaluate various NLP models such as large language models (LLMs).
Duration
1 January 2024 - 31 December 2025
Funding body
German Federal Ministry of Education and Research
Partners
- Charité - Universitätsmedizin Berlin
- ID GmbH & Co. KGaA
- Technische Universität Darmstadt
- Technische Universität Dresden
- Universitätsklinikum Erlangen
- Universitätsmedizin Essen
- Averbis GmbH
- Medizinische Hochschule Hannover
- Universitätsklinikum Heidelberg
- Universität Leipzig
- Ludwig-Maximilians Universität München
- Technische Universität München
- Universität Münster
- Hasso-Plattner-Institut für Digital Engineering gGmbH
- Universitätsklinikum Tübingen
- Medizinische Universität Graz (Assoziierter Partner)