The DataHub Architecture
This section gives an overview on the DataHub regarding its structure and services, how they interact and what purposes they serve. If you are new to the DataHub or if you don't have experience working with git or the HPC environment, please contact the DataHub Team or your local Data Stewards before starting to use it.
The DataHub builds on its own storage and compute resources and cooperates with the University Computer Center in Marburg. It enables affiliated researchers to work in a common infrastructure with similar workflows and thus allows efficient collaboration.
Compute Infrastructure
The DataHub compute resources are integrated into and managed by a larger high performance computing (HPC) cluster, the Marburg Compute Cluster (MaRC3). Depending on your needs and experience, there are different ways to access and use the compute resources some of which are shown in Figure 1.

Figure 1: Ways to use the compute infrastructure: 1) Access the JupyterHub via web browser (HTTPS) to create an interactive JupyterLab session on the cluster, 2) access the cluster directly via command line interface and SSH to run jobs via the management system SLURM.
For users who like to work interactively with their data, the JupyterHub might be a good choice. It can be accessed at https://jupyter-hpc.uni-marburg.de and allows you to use JupyterLab / Jupyter Notebooks for comprehensive, multi-language data science. For more information see JupyterHub section.
Users who have elevated requirements on computational power can establish an SSH connection and interact with the Linux shell of MaRC3. You can access MaRC3 via command line using:
ssh -p 223 <username>@marc3a.hrz.uni-marburg.de.
Compute jobs can be enqueued directly at workload management system, allowing you to be very specific on kind and amount of computing power you need. This approach allows you also to scale your processing, especially by parallelizing it. For more information see HPC section.
Storage Infrastructure
Important Note:
The TAM GitLabTM for use within the DataHub is now available (Marburg University network / VPN required). We encourage all DataHub users to store and manage their research data and code on the TAM GitLabTM! For publication and external findability of research output, the TAM DataHub Repository (DSpace) is currently being set up and test operation will start shortly.
These services replace our previous local GIN platform as further software development of the GIN project and sustainable operation was not foreseeable.
Data in the TAM GitLabTM are actually stored in a secured high performance computing infrastructure at the Marburg Storage Cluster (MaSC). The storage cluster contains the TAM storage resources which currently amount to about 700 TB total usable capacity. The MaSC itself is directly connected to the MaRC3 HPC which brings the advantage of fast data transport for computation. Figure 2 illustrates ways to access the common storage.

Figure 2: Routes to access the storage infrastructure.
Users may interact with TAM GitLabTM via web browser or command line interface. Interaction with GitLabTM is also possible and potentially faster, when working on MaRC3 either directly via SSH, or by using JupyterHub. Git LFS is used for efficient file storage in a git-based workflow. Using MaRC3 provides access to additional storage pools like /home or /masc_shared (see HPC section for details). Client-side software may be used to browse and interact the MaRC3 file systems (including MaSC) conveniently via SSH. In special cases, mounting /masc_shared via SMB is also possible but not recommended.
Versioning and Provenance Tracking
To prevent getting lost with your own data or other person's data, we believe that version control and tracking of data provenance are the most important baselines. At the DataHub we decided to use git technology to this purpose and have set up a GitLabTM instance for TAM, which in conjunction with Git LFS enables efficient workflows with git. While the basic git is optimized to work collaboratively on code (or text in general), git-LFS is optimized to work also on large binary data (like videos or any arbitrary recordings).
Git helps you to track, reproduce and integrate different versions of a dataset both in a sequential way ("history") and also in a parallel fashion ("branches"). This means you can try different variants of an analysis in parallel, work with them, and keep a reusable record of how these variants evolved (provenance tracking).
Data Organization and Description
To make understanding and re-using data and code easier, we ask participating researchers to adhere to some common principles of data organization and description.
We ask affiliated researchers to use a certain template folder structures which follow the philosophy of the so called “TONIC” template for project repositories. Concerning actual data repositories, we strongly recommend using the BIDS standard whenever possible and as early as reasonable in the succession of data processing. We also ask you to provide at least basic metadata for each dataset that is uploaded to the shared storage.
→ Please find details and templates in the Data and Code Management section.
As TAM is much about collaborating and sharing data in a sustainable way, both internally and with a larger community, common data organization and description are very important aspects.