FAQ and Troubleshoot
If you should not find the answer to your questing here, you are welcome to reach out to the DataHub Team.
For general information and FAQ on research data management, you may also visit the website of the hessian federal state initiative HeFDI or browse their published materials at zenodo.
Important note:
The TAM GitLabTM for use within the DataHub is now available (Marburg University network / VPN required). We encourage all DataHub users to use the TAM GitLabTM to manage their research data and code as this facilitates exchange within the project.
This GitLabTM has replaced our local GIN platform, which was discontinued, because further software developed and sustainable operation was not foreseeable.
This FAQ also contains information on using the deprecated GIN platform and the DataLad tool which both make use of git-annex to handle large data. These information has lost most of its relevance and will be removed in future versions.
Getting access
High performance compute and Jupyter_HPC
The DataHub offers an onboarding workflow which provides you with necessary accounts, permission and information. Please see our access page and register for our services.
To use HPC and Jupyter_HPC, you have to:
- Get a Marburg University (UMR) staff account,
- Get permission for the high performance compute cluster MaRC3,
- First usage: Change your shell to
bashusing this webform, - Remote users: Connect via VPN with two-factor authentication see access page.
- Connect via SSH (terminal or other application):
ssh -p 223 USERNAME@marc3a.hrz.uni-marburg.de
or via web-browser: https://jupyter-hpc.uni-marburg.de.
TAM DataHub Repository
The TAM DataHub Repository which bases on the DSpace publication platform is currently being set up for TAM to enable publishing data and metadata with open or restricted access and obtaining persistent identifiers (DOIs).
The service will start test operation shortly. While it is not yet available, you are very welcome to contact your local Data Stewards to select suitable publication platforms (see also RDM section of this manual).
Git
What is git and how does it work?
This will be an overview answer on a questing that potentially fills books to answer in detail. It focusses on some basic principles, why git is useful beyond software engineering and what is to be considered when using it for managing large amounts of data. Note that we offer comprehensive materials and workshops on using git and GitLabTM on the NOWA Website or via the NOWA School.
Git is a tool for distributed version management. Distributed means, that every collaborator has a copy of the repository and its history. A git repository is a directory with all its contents (files & subdirectories) which is managed by git, meaning there is a special hidden .git folder that contains essentially the "history" of the repository. Changes are integrated explicitly by committing a new version of the whole repository to this history and differences between collaborators are also explicitly integrated "merged". A Git repository is not only able to store and reproduce sequential versions ("history") but also enables parallel version tracks called "branches". In a branch you could for instance check the influence of a changed parameter. With these concepts you gain a great capacity of organizing your work while always being able to revisit and reuse earlier stages.
To store files and their versions, git initially saves the files and then records only the line-by-line differences at each commit.Git can thus reconstruct any committed state of a repository and highlight the changes. This works nicely for text files like programming code, descriptive text (like .txt and .md) or tabular data (like .tsv and .csv). However, for binary files like photos or videos (.jpg / .mp4), git cannot efficiently track the changes and it takes considerable computational power to reconstruct file versions. Therefore extensions like Git LFS are used which identify files by hash-sums and always store a full copy of the file, if a change is detected. As a consequence, it is not advisable to modify binary content arbitrarily often, as the repo's size would grow rapidly.
Issue: Unprotected private keys are ignored
Error:
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: UNPROTECTED PRIVATE KEY FILE! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Permissions 0755 for '/home/USER_NAME/.ssh/id_ed25519' are too open.
It is required that your private key files are NOT accessible by others.
This private key will be ignored.
Load key "/home/USER_NAME/.ssh/id_ed25519": bad permissions
git@gitlab.uni-marburg.de: Permission denied (publickey).
Solution:
Linux
- Open a terminal window and set write protection for private key:
chmod 600 ~/.ssh/KEYNAME
Windows (Win10)
- Navigate to private key (typically at C:\Users\USER_NAME\.ssh\KEYNAME)
- Open file properties, select security panel and click "Advanced" button to open the file permission panel.
- Disable inheritance and convert to explicit permissions.
- Remove all permission except for the current user.
- Apply changes and quit.
Figure: Restricting permissions for private SSH key on Windows. Workflow adapted from superuser.com.
GitLabTM
There are multiple instances of GitLabTM. Which one should I use?
We ask all DataHub users to use the TAM GitLabTM instance to store and manage their research data.
The TAM GitLabTM builds on TAM-owned resources in the high performance infrastructure of MaSC and MaRC which allows us to offer a substantial amount of already paid storage in an environment with elevated security. As there are now multiple instances of GitLabTM, we particularly ask you to use this TAM GitLabTM but NOT THE MARBURG UNIVERSITY GITLABTM for research data that is in the scope of the DataHub. If you already have such data on the Marburg University GitLabTM, please move it to our TAM GitLabTM.
Can I use DataLad to manage a repository on GitLabTM?
That is possible. As many DataLad commands will work also on pure git repositories, you are free to use DataLad or git commands. However, please be aware that these repositories and their remote origin at GitLabTM CANNOT USE GIT-ANNEX!
Please use Git LFS, if you store large files on GitLabTM.
Git LFS
What is Git LFS?
Git LFS stands for Large File Storage and is an open source extension to git which allows it to handle large / binary files efficiently. Without such an extension, repositories easily bloat up and get impractical to use. Git LFS solves the problem by redirecting large files to a separate storage space and lets Git handle only references to the LFS files which keeps the repo small.
You can define individual files, filetypes or directories which shall be managed this way.
How to install Git LFS?
You can install the latest Git LFS client as described on git-lfs.com by downloading a current installer or via the PackageCloud repository.
Depending on your operating system, you may also use their native package managing tools. Please be aware that you may install older Git LFS versions this way!
- Mac OS:
brew install git-lfs - Ubuntu / Debian:
sudo apt install git-lfs.
On MaRC3 / JupyterHub, Git LFS has already been installedand enabled as a default module JupyterHub servers have it activated by default.
Be aware that if you purge all modules, you might need to enable the git lfs module again: module load git-lfs. Otherwise, your files will not be forwarded to the LFS store and may blow up your repo!
To check successful installation / activation, you can use git lfs --version.
Afterwards Git LFS has to be initialized once for your user: git lfs install. This also applies to first usage on MaRC3 / JupyterHub.
How to track large files with Git LFS?
For each git repository, you can now define which files shall be handled by Git LFS for instance:
- files:
git lfs track "07_disseminations/myManuscript.pdf" - directories:
git lfs track "03_data/001_myExperiment/*" - file types
git lfs track "*.pdf" "*.png"
To summarize these rules, you can use git lfs track or inspect the .gitattributes file in your repo's root. With git lfs ls-files you can list all lfs files of a repo. The tracking patterns are written to the .gitattributes file in the root of each git repo and you can also inspect or edit the file directly.
Negative patterns like in .gitignore are not supported by Git LFS(docs).
It is important to think of the filetypes and paths beforehand committing and pushing large files. Otherwise they will be tracked by git and will require specialized tools to reshape the commit history and migrate them them to LFS.
Limitations
- Currently, Git LFS does not allow to track files, based on their size (e.g. >1MB).
- On Windows, git before 2.34.0 does not handle files in the working tree larger than 4 gigabytes. Newer versions of git, as well as Unix versions, are unaffected.
- If not all contribution persons have lfs installed, untouched objects might be shown as modified.
How to manage previously committed files with Git LFS?
You may have existing repos with binary files which are managed by git. For better performance and stability you want to use Git LFS not only for new binary files, but also for the old ones. Alternatively, you may have unintentionally committed files to the git repo which should have gone to LFS store. This might happen for instance, if you don't enable the git-lfs module on the MaRC3 cluster. Potentially, you might have also already committed an additional copy of those files to the LFS store while one version of the files remains in the git history.
Git LFS comes with a migration tool which essentially rewrites your git history and enables you to migrate files between the git-repo and the LFS-store as if they would have always been there. It also updates the Git LFS tracking pattern in the .gitattributes file if necessary.
Say, you wanted to track all jpg and png files with git lfs and replace the existing files by git lfs pointers:
- Migrate locally (change file pattern to your needs):
git lfs migrate import --include="*.jpg, *.png" --everything - If you use a remote repository on GitLabTM, overwrite it with your local changes. Be aware that this may cause conflicts, if collaborating persons have already pulled the to be modified commits.
git push --all --force
How to use the Git LFS locking feature?
With a relatively new feature of Git LFS, you can prevent others from concurrently modifying files you intend to edit. The feature works not exclusively on LFS files.
- Defining files as lockable will remove write permissions from these files in the working directory: git lfs track "*.jpg" --lockable
- To edit those files, you need to lock them (for others): git lfs lock FILENAME.ext.
- To get a list of currently locked files and corresponding users, use git lfs locks.
- Other users will be blocked from locking an already locked file.
- If you have locked and modified a file, you must first commit changes before you can unlock it again. Use git lfs unlockfor this.
Note that despite using this feature, you may still produce merge conflicts on lockable files. To prevent this:
- Pull changes before you lock files to edit them.
- If you have committed a change of a locked file, push it before you unlock the file again.
How to solve merge conflicts on LFS files?
Merge conflicts arise when conflicting versions of files from differnet sources need to be integrated. For binary files, these can not be solved by a line by line integration but only by telling git with which file you want to proceed. It is thus recommendable to avoid merge conflicts as shown below by communicating with collaborators (and using the previously described locking feature).
Creating a merge conflict
- Both 'a' and 'b' have a repo with an LFS file, a jpg image for instance.
- They each modify the file differently and commit their changes.
- 'b' pushes to origin and afterwards 'a' pulls from there, so that 'a' has to solve a conflict between the local and the remote commit from 'b'.
- In 'a's working directory, Git LFS replaces the file by an lfs pointer file in which the differences (SHA and file size) are shown.
Solving the conflict
- The pointer file is manually modified to point one of the versions for example to 'our' version (modification by 'a') by deleting SHA and file size of the remote version (modification from 'b') plus all git markers and saving the file. The merge commit is pushed to origin. GitLabTM recognizes the changed SHA reference and now shows a preview of the modification by 'a'.
- The working directory of 'a' still holds the pointer instead of the jpg image. With
git lfs pull, the pointer is again replaced by the referenced file in the working directory. - From the perspective of 'b', a simple
git pullwill update the working directory with the version from 'a' without showing the pointer files.
How are git-annex and Git LFS related?
Both tools serve the same general purpose in enabling git to manage large files. Git-annex is generally considered the more versatile but complex tool while Git LFS is often considered to be easier to use but lacks some features that git-annex has.
Git-annex handles specified large files by storing them as "annex-objects" and replacing them with symlinks in the working tree. If you need to modify them, they can be replaced with a copy of the actual files and be unlocked. Git annex tracks the availability of copies of your files in different locations and allows you to "drop" local file content while preserving the surrogate symlink and the availability information. Git-annex is supported by DataLad and a few repositories like GIN.
Git LFS also uses a storing mechanism separate from the main git repo and replaces the files so that git will manage only small text pointers to the files instead of the actual files themselves. It lacks the ability to drop files like git-annex but has other features like locking files for other users, if they are modified locally, preventing merge conflicts on binary files. It is supported by mayor git services like GitLabTM and GitHub.
The operational mechanisms of Git LFS and git-annex and Git LFS are distinct and we are not aware of a generally applicable, easy way to migrate binary data between both data management solutions. So please choose carefully.
High Performance Computing (MaRC3, MaSC)
Is there a Manual for using the MaRC3?
Yes, see here. However, access to the documentation is restricted to Marburg University Network (incl. VPN) and requires login with staff account name & passwort (no 2FA).
The website of the Competence Center for High Performance Computing in Hessen (HKHLR) summarizes cluster introductions, classes and tutorials on HPC topics. For the MaRC3, there is a biannual cluster introductions, held by René Sitt. Slides of a past presentation can be downloaded here (10.11.2022, login required).
How can I transfer files to and from MaRC3?
You have multiple options for file transfer including:
- A central GitLabTM repository using Git LFS. This is the generally recommended way, as it helps to keep data organized, version controlled and thus transparent.
- A JupyterLab session via JupyterHub. Jupyter gives you a graphical interface to upload individual files e.g. to your home directory or download individual files from there to your local computer.
-
The
scpcommand. Secure Copy is a simple command which establishes a secured connection to a remote location like MaRC3 and allows transferring files and directories in both directions: -
General usage:
scp -r -P PORT_NUMBER /SOURCE_DIRECTORY/ USER@HOST:/DESTINATION_DIRECTORY/
Note: the-rflag is required to transfer directories instead of individual files. - MaRC3 examples:
scp -r -P 223 SOURCE USER@marc3a.hrz.uni-marburg.de:DESTINATION,
scp -r -P 223 USER@marc3a.hrz.uni-marburg.de:SOURCE DESTINATION. Replace SOURCE by your source files or directories, USER by your username (staff account) and DESTINATION by the desired remote /local destination (e.g. your home directory~).
JupyterHub
General FAQ on JupyterLab can be found at the online documentation of the Jupyter project.
Which JupyterHub should I use?
There are multiple Jupyter services (instances of JupyterHub). The University of Marburg alone offers our Jupyter_HPC and Jupyter_UMR. Other service providers may offer additional instances with varying conditions, configurations and performance.
The DataHub can only provide specific support for the Jupyter_HPC service. In contrast to Jupyter_UMR, it offers by far more powerful hardware configurations, most of the software available form the MaRC3 and interaction with file systems of the MaRC3 environment. On top of this, it is streamlined for our git-workflows in the DataHub. For instance, Git and Git-LFS are available and enabled by default. Jupyter_UMR however has a more general use case which also includes teaching.
Can I use Jupyter_HPC for sensitive data processing?
No, sadly not.
Neither Jupyter_HPC nor Jupyter_UMR qualify for processing of sensitive personal data. For Jupyter_HPC, the MaRC3 terms of use do not permit this as sensitive personal data must not be processed on shared nodes like the one used for Jupyter.
If you would like to use Jupyter with sensitive data, we currently have to recommend a local Jupyter installation on your workstation.
Workflow
How can I speed up the workflow?
There may be some repetitive elements in your workflow, like activation of certain modules, a specific software environment or unlocking credentials on each login (see security section). If you are using the bash shell (default for many Linux systems, MaRC3, and available via gitbash for other operating systems), you can edit the bash configuration file .bashrc. Restarting the CLI is required to apply the changes.
Example: Loading a specific software packages after each login.
Requirements: You are working on MaRC3 with bash shell.
You may want to load the miniconda module and activate a certain virtual environment by default (replace "XYZ"). As an alternative to appending the configuration, you can open it in a CLI editor like nano nano ~/.bashrc and insert the following:
# Load conda and a private environment
module load miniconda
source $CONDA_ROOT/bin/activate
conda activate XYZ'
Example: Aliases in .bashrc
Requirements: You are using the bash shell (CLI).
Aliases are labels which can be used as a surrogate for a longer and more complex expression. You may want to define an alias to quickly connect to MaRC3: Add the following to your .bashrc and replace USERNAME with your staff-account:
alias ssh_hpc='ssh -p 223 USERNAME@marc3a.hrz.uni-marburg.de'
Security
Passphrases on ssh keys and how to manage them
It is good practice to use passphrases on ssh keys because it adds an extra layer of security by encrypting your private key file. If someone would get hold of your private key, it would still have to be decrypted first. The downside is that also you would need to provide the passphrase whenever you use the private key (e.g. git fetch) which very inconvenient.
The tool ssh-agent allows to unlock the public key just once per session and keep it in memory for further usage. To use the tool in the HPC context, it is most convenient to add these lines to your .bashrc file. If you have multiple keys in your ~/.ssh directory, you may want to add the path to the specific private key file to the command ssh-add PATH/TO/KEY.
# Activate ssh.agent and add key
eval $(ssh-agent)
ssh-add
In the MaRC3 HPC environment you may want to add this to your .bashrc. On your local machine you can use for instance the KeePassXC password manager to unlock your private key via ssh-agent.
How to verify checksums
When you download software from the internet, you are often provided with a hash sum for the package / installer. With these checksums you can verify the file's identity and integrity. Although it provides no absolute protection, it is good practice to compare the checksum of the downloaded file against the provided checksum.
Dependin on your operating system you have different (command line) tools to calculate a checksum for any file:
- Windows:
Get-FileHash FILE_NAME - Linux:
sha256sum FILE_NAME - Mac:
shasum -a 256 FILE_NAME
Other Tools and Platforms
Environment virtualization with anaconda and miniconda and venv
Anaconda and its light-weight alternative Miniconda allow you to manage software packages and virtual environments. Depending on your working environment and its operating system you may need to install software without having administrator / root privileges. It also enables you to work in isolated software environments for different projects. A similar, purely python-based toolset is venv (virtual environment management) and pip (package installer for Python). More information on environment virtualization is provided on the NOWA School website.
If conda is not yet available on your computer, you will find the installers for all common systems plus a user guide at the conda website.
Hints:
- Activating an environment by default: see below.
- Listing private environments:
conda info --envs. - Deactivating environments:
conda deactivate. - Removing unused environments:
conda remove -n ENV_NAME --all.
DataLad
DataLad in the DataHub
Using DataLad has been recommended to interact with the DataHubs previous storage management and data sharing platform GIN. Since we have switched to GitLabTM which uses a different mechanism to manage large files (Git LFS), DataLad moved out of focus for the DataHub.
What is DataLad?
DataLad is a free and open source distributed data management system that:
- keeps track of your data,
- creates structure,
- ensures reproducibility,
- supports collaboration,
- and integrates with widely used data infrastructure.
DataLad is a Python-based tool that is compatible with all major operating systems. Using Git and git-annex, it allows you to version control arbitrarily large files in datasets, without the need for third party services.
A DataLad dataset is a directory with files, managed by DataLad. You can link other datasets, known as subdatasets, and perform commands recursively across an arbitrarily deep hierarchy of datasets (see figure below). This helps you to create structure while maintaining advanced provenance capture abilities, versioning, and actionable file retrieval.
Figure 1: Managing research data with linked datasets (taken from the DataLad Handbook).
DataLad lets you consume datasets provided by others, and collaborate with them. You can install existing datasets and update them from various sources, or create sibling datasets at GIN that you can push updates to and pull from.
Basic Usage
Although DataLad has extensive functionality, you can quickly start using it with a small set of essential commands like:
createa new local datasetcloneorinstallexisting datasets from a remote storagecreate-sibling: create remote copies of datasets called "siblings"getcontent on-demandruncompute jobs in a reproducible waydropcontent if no longer neededsavechanges to datasetspushchanges to a remote storage.
Summarized form datalad.org.
GIN
GIN was the DataHub's designated central platform for storing, distribution and publication of research data. As the development of the platform has been discontinued and sustainable operation could not be guaranteed, our local GIN service has been replaced by the TAM-internal GitLabTM.