The development phase
The development of the basic concept and the curriculum is accompanied by the Research Data Working Group of the U Bremen Research Alliance, in which all member institutions are represented by representatives. After an intensive external research on existing training concepts and curricula, a qualitative survey of the status quo and training demands in the U Bremen Research Alliance was carried out in 2020. In more than 70 guided interviews, competence needs in Research Data Management and Data Science were assessed among members of different status groups and sectors (management, infrastructure and research). The focus was on (senior) scientists from all scientific disciplines, representatives of the research data repositories located in Bremen (PANGAEA, Qualiservice) and coordinators of internal and external graduate programs who are able to derive qualification needs from their daily work.
The qualitative approach was complemented by a quantitative survey. On the one hand, the information from the surveys forms the basis for the development of the target group-specific data train concept and the curriculum. On the other hand, they are the basis for future topic-related activities (e.g., support services, cooperation projects, etc.) in the U Bremen Research Alliance.
The Data Train model
Data Train is currently composed of two training components:
Courses (curriculum, arranged in "Tracks”)
Invited talks from the private sector and academia ("Data Stories")
The Data Train training program is offered to doctoral researchers across institutes and disciplines, parallel to doctoral studies and structured doctoral programs. Due to the limited time resources and the differing amount of previous knowledge doctoral researchers have, and their strongly varying thematic orientation, the curriculum should:
Focus on basic, discipline-agnostic competencies in Research Data Management (RDM) and DS
Be flexible and offered on a voluntary basis
In terms of content, the curriculum covers the entire data value chain, i.e., all the steps of the data life cycle, including Data science applications, as shown in Stodden (2020). The skill set concentrates on competences researchers need. The individual courses are structurally oriented to the data value chain, but can also stand alone. This allows uncomplicated modifications of the curriculum and gives the PhD students the possibility to select courses (“course picking principle”). In addition, the courses are divided according to their level of difficulty and the field of competence
Accordingly, there are three training paths, consisting of individual courses, (in the following referred to as "tracks"), which doctoral students can follow: a path for beginners ("Starter Track", level 1: “understand”) with overview lectures on superordinate topics of both topics of both fields of competence and two tracks ("Operator Tracks", level 2 “apply”) with hands-on workshops, thematically divided into the "Data Steward Track" and the "Data Scientist Track" (see figure below).
The Starter Track is intended to introduce doctoral researchers to the overarching topics of both fields of competence (RDM and DS) and to convey basic knowledge in an overview-like manner (lectures of 2-3 hours). It serves as a guide for the selection of the more time-consuming workshops of the operator tracks (average duration: 2-3 days) and therefore takes place at the beginning of each run of the curriculum.
For the conceptual design of the operator tracks, it is necessary to identify the core competencies of both fields along the data value chain. The operator tracks therefore take place one after the other.
Selection of the trained competences
The tasks of a data steward differ depending on the discipline, the area of application within science and local conditions. There is agreement that data stewards should implement the FAIR principles. The FAIR "data culture," according to Scholten et al. (2019), needs to be implemented in different sectors of science: A) Infrastructure, B) Management (higher-level policy), C) research.
Following this approach, the Data Train training focuses on the research sector and the essential competencies in a target group-specific manner, that researchers need for the FAIR handling of data: Programming skills (R, Python, MATLAB), workflows in FDM (Data management plans, reproducibility, data preparation/processing), innovative software and IT applications (database skills, data provision or extraction from online platforms) - especially also for collaborative work (Git/GitHub). The training provided as part of the data steward track is not equivalent to training for professional data stewards. The training for professional, subject-specific data stewards, who act in an advisory and coordinating way in scientific institutions is not (entirely) covered by the Data Train program.
Data scientists need competencies in statistics, mathematics, computer science, as well as the respective domain expertise (see figure below). The thematic selection of such competencies is based on the conducted status quo and demand assessment in the U Bremen Research Alliance as well as existing curricula and position papers (e.g. position paper of the Society of Computer Science, FAIRsFAIR and EDISON competence framework). Incorporating the needs formulations into the selection of curricular content results in that this is in a certain way specific to the research and requirement spectrum present in Bremen (quantitative analysis methods, machine learning (ML) methods, deep learning methods, methods to evaluate techniques, algorithms and results, data visualization and visual analytics).
In addition, "critical thinking" deemed necessary to appropriately and responsible decision-making about the processing, sharing, and use of data is fostered and therefore, an important component of the entire curriculum addressed by all three tracks (e.g. philosophical reflections on data science, meaningfulness of data, digital ethics, asking right research questions in data science, evaluating AI/ML algorithms, philosophical issues related to data visualization).
A cross disciplines, a common language should be established that is aware of potential limitations and difficulties. This leads to more informed acceptance and openness to novel working paths, research approaches and technologies. The ethical and legal frameworks of both, RDM and Data Science, should be conveyed.