GeMTeX

Address:	ZB MED – Information Centre for Life Sciences Gleueler Straße 60 50931 Cologne (Köln) Germany

Address:	ZB MED – Information Centre for Life Sciences Friedrich-Hirzebruch-Allee 4 53115 Bonn Germany

German Medical Text Corpus

Many texts are produced in the course of routine patient care, including admission notes and diagnostic reports. These contain valuable information on a patient’s case history, disease progression and treatment. However, the unstructured nature of this documentation – combined with its lack of machine readability – makes it impossible to exploit its full potential. AI-based methods of automated natural language processing (NLP) offer a way to automatically structure the information contained in clinical documentation in order to support doctors and researchers in their work.

Training such NLP models requires the use of anonymised data, very little of which is currently available in German. This is the gap that the GeMTeX methodology platform is designed to fill. Its goal is to prepare clinical documentation created during patient care in a machine-readable format, thus making it available for use in medical research projects. Using clinical documentation from the information systems of six university hospitals, GeMTeX aims to create the largest anonymised medical text training corpus in the German language, enriched with annotations of entities and relations. Strong governance will provide a stable legal framework that will allow the corpus to be used in accordance with the regulations of the Medical Informatics Initiative (MII). GeMTeX will employ state-of-the-art NLP methods to build, pre-annotate and annotate the corpus and to train language models.

Automated indexing of medical texts for research

GeMTeX’s core mission is to collect a balanced mix of different types of clinical documentation from different medical disciplines and to make these accessible for NLP. De-identification and annotation of these texts is carried out by trained teams, who also add meta-information. The GeMTeX project also includes the creation of annotation guidelines to guarantee the anonymity of the training texts and the quality of the annotations.

Tools and methods for automated information extraction will be created alongside the text corpus, and cutting-edge deep-learning models will be trained and validated for specific application scenarios.

ZB MED’s role in the project

ZB MED is involved in several aspects of the GeMTeX project, including the development of annotation guidelines, the creation of quality control mechanisms for annotations, and the specification of the conditions under which access will be granted to the corresponding corpora. ZB MED’s goal is to establish standards for the discoverability and sustainable reusability of medical training corpora. To further improve its pipeline for medical information extraction, ZB MED will also use the training corpus created within GeMTeX to train and evaluate various NLP models such as large language models (LLMs).

Duration

1 January 2024 - 31 December 2027

Funding body

German Federal Ministry of Education and Research

Get in touch.

Contact

Cologne Site

Bonn Site

InfoCenter

INFORMATION.

KNOWLEDGE.
LIFE.

Life sciences research needs a solid infrastructure.

GeMTeX

German Medical Text Corpus

ZB MED’s role in the project

Duration

Funding body

Partners

ZB MED is committed to the principles of Libraries4Future.

Local universities cooperate closely with ZB MED as partners

ZB MED is sponsered by the Ministry of Culture and Science of North Rhine-Westphalia

Local universities cooperate closely with ZB MED as partners

ZB MED advocates Open Access

Local universities cooperate closely with ZB MED as partners

ZB MED is a member of DataCite

ZB MED and Bielefeld University have signed a cooperation agreement

ZB MED advocates equality

ZB MED is one of over 30 partners in the LFV Open Science

ZB MED is a member of DAFA.

ZB MED is committed to the principles of Libraries4Future.

Local universities cooperate closely with ZB MED as partners

Contact

Cologne Site

Bonn Site

InfoCenter

INFORMATION.

KNOWLEDGE. LIFE.

Life sciences research needs a solid infrastructure.

GeMTeX

German Medical Text Corpus

ZB MED’s role in the project

Duration

Funding body

Partners

Local universities cooperate closely with ZB MED as partners

KNOWLEDGE.
LIFE.