Skip to content

DataHub workflow instructions and tutorials

The following pages offer practical instructions on how to use the DataHub services for preparing, uploading, computing, downloading, and sharing your data. Written instructions as well as tutorial videos are provided.

However, this Manual is meant to be used after you consulted the local Data Stewards and already received an introduction by e.g., a workshop to all services. The following pages can only give general instructions but the workflow might differ for your specific case. For comprehensive background information on all services, see services pages.

You can get a first impression of the DataHUb workflows from our demonstration videos.

Access to the DataHub

To use the DataHub services, you need to have a staff-account at the University of Marburg and the permission to use DataHub Services. Please see the section Access for information on how to get an account and/or permissions.

Required Installations

To make the most of the GitLabTM service, it is highly recommended to also use Git as a local version control system. It offers you easy up- and downloads of your data and code from your local computer. Manual uploads to GitLabTM might be slow, depending on the size of the files, and it is only possible in a file-by-file manner. With Git installed it is possible to only push the updates on your repository to GitLabTM, which is faster and saves memory and space. Also, if you don't use Git, you will always have to download the full project form GitLabTM to get the latest version of your GitLabTM-project (e.g., when your collaborator modified the project). With Git, again, you can only pull the latest updates.

Git should be installed on the platforms you actively work with during your project. This can be: Your office computer, your laptop, the computer that acquires the data, or MaRC3a of the DataHub. The advantages of using MaRC3a are outlined here and can be accessed with your UMR staff account.

Git Configuration

Please download Git on your local computer. MaRC3a already has a running Git installation.

After installation, you need to properly configure Git. Please type the following commands one after another in your command line / terminal (note: If you want to work with Git on your local computer as well as on MaRC3a, you have to do this configurations on both machines):

git --version
→ this tells you if the installation was successful by telling you the version number of Git you installed.
git config --global user.name "Example Name"
→ quotation marks are needed.
git config --global user.email example@example.com
git config --global init.defaultBranch main
git config --global core.editor ExampleEditor
→ this sets a default text editor when working with git. Just type in the editor you want to use. My recommendation is to use nano or vim. You don’t have to download anything, it’s already on your machine as it is a command-line editor. So, it’s super convenient for working with git, as the git commands are also run in the command-line. It simply means that when writing commit messages you can do it in one window (i.e., your terminal) and you don’t have to bother with a new window opening and making sure it’s really closed etc. But of course, you can use any text editor you want. Just as an example, for making nano your global git-editor type git config --global core.editor nano.

git config --global core.autocrlf input
This configuration solves the problem of different end of line implementaions between Mac/Linux (LF) and Windows (CR LF). This can be very helpful in mixed teams.

git config --global pull.rebase false
git config --global pull.merge true
cat ~/.gitconfig
→ checks if your configuration was successful. It should give you this output:
[user] 
  name = Your Name 
  email = example@example.com
[init] 
  defaultBranch = main
[core]
    editor = nano
[pull] 
  rebase = false 
  merge = true
The nano and vim text editors

In case you configured nano or vim as your default text editor when using git (vim is the default editor on MaRC3a/JupyterHub): Those text editors are command-line editors which means they open directly in the terminal (no new window opens). With everything, you just need to know how to operate it:

nano:
- when the editor opens for the commit message, you need to first press enter one time to make new line at the top.
- when you've done writing your commit message, you need to save and close it: CTRL X closes it but it asks you first if you want to save it by typing a capital Y, then Enter.

vim:
- when the editor opens you first need to type i to change from command mode to insert mode
- make a new line at the top by pressing Enter
- write your commit message and then press Esc and then :wq to save and exit

SSH Key Generation

To communicate with GitLabTM, you should have an ssh key assigned. You can follow these instructions to do so (login to the TAM GitLabTM required, of course). Then, go to the TAM GitLabTM and save this ssh key in your account preferences: Your Profile -> Preferences -> SSH Keys -> Add new key. Please check connection:

ssh git@tam-gitlab.online.uni-marburg.de

"Welcome to GitLab, your-username".

SSH Key Generation on MaRC3a

Login to the JupyterHub and open a terminal.

1. Generate ssh key: ssh-keygen -t ed25519
2. Display your public key and copy it: cat ~/.ssh/id_ed25519.pub
3. Open your ssh-key config file and add Hosts manually: vi .ssh/config → In-terminal text editor opens. Press i to enable the insert mode.

Your file currently contains this:

Added by Warewulf  
Host *
    IdentityFile ~/.ssh/cluster
    StrictHostKeyChecking=no

Please add the following manually below (respect formatting/ spaces):

 Host tam-gitlab.online.uni-marburg.de
   IdentityFile ~/.ssh/id_ed25519 

To exit the text editor press esc and then type :wq and then press enter.

4. Go to the TAM GitLabTM and save this ssh key in your account preferences: Your Profile -> Preferences -> SSH Keys -> Add new key.
5. Check connection:

ssh git@tam-gitlab.online.uni-marburg.de

"Welcome to GitLab, your-username".

Enable Git LFS

Git LFS stands for Large File Storage and is an open source extension to git which allows it to handle large / binary files efficiently. As we use our TAM GitLabTM not only for code development but also as the central Data Storage during active development of the project, we ask everyone to use Git LFS for their project. You don't need to learn any new Git commands for this, you just need to run the following configurations once:

1. Download git-lfs. On MaRC3 / JupyterHub, Git LFS has already been installed. JupyterHub servers have it activated by default. If you connect to MaRC3 via terminal (ssh), you can enable Git LFS via its module: module load git-lfs.
2. Check installation: git lfs --version
3. Initialize git-lfs once for your username: git lfs install. This also applies to first usage on MaRC3 / JupyterHub.

If you do a cat ~/.gitconfig again, you should see the following added to your git configuration:

[filter "lfs"]
    smudge = git-lfs smudge -- %f
    process = git-lfs filter-process
    required = true
    clean = git-lfs clean -- %f

To learn more about Git LFS, check our FAQs. In the next chapters you will learn how to use Git LFS in the DataHub workflow.

Working with virtual environments

We recommend to first set up an environment using the environment and package manager conda. Creating environments basically means that you separate the work on your local computer into different environments. In these single environments you only have installed the software you actually need for this particular work and also only in the versions you need for this particular work. This is very useful as sometimes updates of a software can affect your previous work. See below for ressources and tutorials on virtual environments.

If you want to work in an environment on your local computer, please skip the steps 1.-4 and start with step 5. Step 1.-4. only apply if you want to work on the MaRC3a cluster.

By the nature of the MaRC3a cluster, you're provided only with a simple Linux environment at the beginning and you need to first activate miniconda before you can use it. You don't need to do this if you're working on your local machine.

1. Go to the JupyterHub of Marburg university and open a terminal. Alternatively, you can connect directly via SSH to the MaRC3 cluster.
2. Make sure to be in your home folder: pwd should give you the output /home/your-username
3. Activate the module Miniconda for environment activation: module load miniconda (In front of your username you should now see (base))
4. Activate environment: source $CONDA_ROOT/bin/activate
5. Create a new empty environment for your project: conda create --name my-project
6. Activate this new empty enviroment: conda activate my-project (In front of your username you should now see (my-project))
7. Install the packages you need in this environment, like so: conda install -c conda-forge pandas==2.2.0
hint: you can create an requirements.txt file that lists all the packages you need and then load this file instead of loading every single packages one after another.

!!NOTE!! Environment reload:
MaRC3a setup: Every time you restart JupyterLab / reconnect to MaRC3, you need to go through steps 3., 4., and 6. again.
Local setup: Every time after you closed the terminal with the environment and want to continue working in this environment, you have to execute step 6. again.

In our FAQs you can find instructions on how to automate the installations described above.

Ressources and tutorials on virtual environments

Everything set up? Great! Let's prepare your data in a next step!