The development phase
The development of the basic concept and the curriculum was monitored by the Research Data Working Group of the U Bremen Research Alliance (UBRA), in which UBRA member institutions, consortia of the National Research Data Infrastructure (NFDI) as well as Research Data Management (RDM) and Data Science (DS) initiatives are represented. After an intensive review of existing training concepts and curricula (e. g. position paper on DS learning contents of the Society of Computer Science (Society of Computer Science, 2019), the FAIRsFAIR teaching and training handbook for higher education institutions (Engelhardt et al. 2022) and the EDISON DS competence framework (Demchenko et al. 2018)), a qualitative survey assessing the status quo and training demands in participating institutions and initiatives was carried out in 2020. We conducted more than 70 guided interviews to assess the competence demands in RDM and DS from different status groups and sectors (management, infrastructure, and research). The focus was on (senior) scientists from all scientific disciplines of UBRA, representatives of the research data repositories located in Bremen (PANGAEA, QUALISERVICE) and coordinators of graduate programmes.
Based on this qualitative survey, a quantitative survey was developed to complement the more narrative information from the qualitative interviews with respect to the extent and range of the required teaching components. For this purpose, more than 6.000 persons including researchers (doctoral researchers, postdocs and professors) and infrastructure personnel were approached and more than 400 questionnaires were filled in.
The Data Train model
Requirements (R): The training should provide basic knowledge on both competency areas, RDM and DS, for researchers (R1). It is offered in addition to structured doctoral programmes on a voluntary basis. Therefore, the limited time resources of doctoral researchers (R2), their strongly varying thematic orientation (R3), and the different levels of prior knowledge (R4) need to be considered. The dynamic development in RDM and DS also requires high flexibility with respect to the courses offered and their content (R5). Moreover, early career researchers benefit from contacts to peers and senior researchers across disciplines and institutes as well as to partners like companies or NFDI consortia (R6) and from a written confirmation of participation (R7). In addition, specific support as well as training on demand are essential for a successful implementation of novel RDM concepts and DS methods by the participants in their daily research workflows (R8).
Operative implementation: In terms of content, the training covers the entire data-value chain, i. e., all steps of the data life cycle, including DS methods and the required technical infrastructure, as shown by Stodden (2020) (R1). By this, synergy effects and innovation potentials of both fields, RDM and DS, will be exploited. Training modules follow a thematic structure according to the data life cycle. However, each module is a stand-alone component (R5). This allows easy modifications of the curriculum and offers the doctoral researchers to select courses according to their individual needs (“course picking principle”)(R2, R3). In addition, modules are structured according to the level of difficulty and the topics, RDM and DS (R3, R4). Participation in modules is confirmed after completion of the respective course series by a ‘Certificate of Participation’ (R7).
Various training formats offer a platform for networking and socialising for participants, lecturers and invited speakers (R6). These also allow to offer individual support as well as training on demand (e. g. self-learning, informal meetings for individual support, summer schools, inspiring talks by partners with networking opportunities) (R8).
Currently, the programme is composed of:
- The curriculum consisting of three course series, the so-called tracks (Starter Track and two Operator Tracks),
- Inspiring talks from the private sector and academia, accompanied by networking opportunities (Data Stories)
In addition, we would like to establish further components such as:
- Additional modules for flexibly serving actual topics and demands (Highlight Modules),
- A compilation of existing digital training materials for self-learning (Digital Toolkit),
- Informal meetings to offer a platform for topic-specific consulting by experts and to foster networking and exchange across disciplines (Hacky Hours),
- A summer school on up-to-date data handling concepts and analysis methods for specific data types (Data Factory).
The curriculum is the main driver of the training programme. There are three tracks consisting of individual courses that doctoral students can follow: a track for beginners (Starter Track, level 1: “understand”) with overview lectures on generic topics of both fields, RDM and DS, and two tracks (Operator Tracks, level 2 “apply”) with hands-on workshops, thematically divided into theData Steward Track(mainly RDM competences) and theData Scientist Track (mainly DS competences) (cf. Figure above).
Each iteration of the curriculum starts with a Kick-off event to introduce the programme (its scope, concept, and its connection to the NFDI) and organisational aspects to the new cohort od participants. Moreover, introductory talks on DS applications and the importance and benefits of RDM should give first insights and strengthen the interest in data competences among the participants.
The Starter Track aims at introducing doctoral researchers to the generic topics of both fields of competence (RDM and DS) and at conveying basic knowledge in an overview-like manner. Moreover, it is intended for anyone who wants to get started in RDM and DS and therefore, due to its online or hybrid format which allows high capacities, open to everyone. This track serves as a guide for the later selection of the more time-consuming workshops of the Operator Tracks and therefore takes place in the beginning of each run. Basics of the following topics are taught during the Starter Track: Data Science, Big Data, Statistics, Computer Science (programming languages, cryptography, privacy and system security), RDM (FAIR principles, data protection and copyright, management of sensitive data (i. e. industrial data and personal data), management of qualitative data).
The Operator Tracks are tailored to the specific needs of Data Stewards and Data Scientists.
The tasks of a Data Steward differ depending on the discipline, the area of application in science and the local conditions. There is agreement that Data Stewards should implement the FAIR principles where a FAIR Data Culture, according to Scholten et al. (2019), needs to be implemented in different sectors of science: (1) infrastructure, (2) management (at a higher-level policy), (3) research. Following this approach, the Data Steward Track focuses on the research sector and the essential competencies that researchers need for the FAIR handling of data: programming skills (R, Python, MATLAB), workflows in RDM (data management plans, reproducibility, data preparation/processing), innovative software and IT applications (database skills, data provision and extraction from online platforms) - especially for collaborative cooperation (Git/GitHub). This track does not replace a professional training for discipline-specific Data Stewards, who act in an advisory, supportive and coordinating way, in scientific institutions (cf. Figure below).
Data Scientists typically focus on the way how data are prepared, processed, analysed and results are transformed into decisions (Society of Computer Science, 2019). They need competencies in statistics, mathematics, computer science, as well as the respective domain expertise (cf. Figure below). Contents of the Data Scientist Track are: quantitative analysis methods, machine learning (ML) methods, deep learning methods, methods to evaluate techniques, algorithms and results, data visualisation and visual analytics.
In addition, Critical Thinking deemed necessary to make appropriate and informed decisions about the processing, sharing, and use of data. Moreover, across disciplines, a common language should be established that is aware of potential limitations. This leads to better acceptance and greater openness towards novel working paths, research approaches and technologies. The ethical and legal frameworks of both, RDM and Data Science, should be discussed and critically reflected. Therefore, Critical Thinking is an important component of the entire curriculum addressed by all three tracks (e. g. by courses on philosophical reflections on data science, meaningfulness of data, digital ethics, asking right research questions in data science, evaluating AI/ML algorithms, philosophical issues related to data visualisation.
In the Data Stories participants listen to inspiring stories about data handling, data management, and data science applied in the private sector or in academia. The speakers shed light on the importance of data competences with regard to their individual working fields and discuss current challenges. These stories raise awareness, provide insights into different working fields, techniques, and applications as well as foster the development of new ideas that the participants may use for their research questions. Each story ends with an opportunity for discussion and a get-together so that participants may build up a cross-discipline and cross-institutional network even beyond science.