1) First Class: Reproducibility and Git#

Today:

  1. Reproducibility in computational sciences

  2. Git

  3. Package environments and managers
    3.1. Julia
    3.2. Python

  4. IDEs or editors

1. Reproducibility in computational sciences#

What is reproducibile code?#

“But it works on my machine!”

Code is reproducible if someone is able to easily re-run it and get the same results.

Why might code not be reproducible:

  • software or packages are not installed

  • software uses different versions

  • different configurations (environment variables, optimization flags)

  • differences between platforms: operating systems (Windows, Linux, MacOS), architectures (x86, ARM, GPUs)

Some of these are unavoidable (e.g. going from working locally on a Mac to the Linux cluster), but we can do our best to minimize these

  • we’re not aiming for “bit-for-bit” reproducibility

How can we make our code reproducible?#

We need to supply both the code, and the “environment” used to run the code, and we need to keep track of it as we run experiments.

2. Git#

Introduction to version control with git#

Version control Software (VCS) is also referred as SCM (Source Code Management) or RCS (Revision Control System) tools. Version control, also known as source control, is the practice of tracking and managing changes to source files. Version control systems are software tools that help teams of contributors manage changes to source code over time. There are several options: git, CVS, SVN, etc.

In this lecture, we will focus on git in particular.

Tip

First of all, if you want to install git on your machine, follow this GitHub guide.

Tip

If you are unfamiliar with git and version control, the following guides can be helpful:

We provide a brief guide here.

GitHub Account#

Tip

If you don’t have a GitHub account, follow these instructions from the SDSU Research & Cyberinfrastructure website and link it to your SDSUid.

  • Use a personal email account rather than the SDSU one, so that you won’t have accessing your GitHub account in the future.

  • Choose your username wisely! Most likely you will use this again in professional settings in your career.

Identity#

First make sure git knows your name and email address:

$ git config --global user.name "Your full name"
$ git config --global user.email "your_email_address@example.com"

Community projects#

  • There are several open-source, community projects hosted on major platforms like GitHub, GiLab or BitBucket. These are collaborative projects, often led or maintained by volunteers or not-for-profit organizations.

  • Here are a few I contributed to:

  • You may want to check out a few of these online communities, and if you think you want to propose any suggested changes to their repositories, you can propose your own suggestions

  • Before proposing any contributions, when checking out open-source, community projects, make sure to read about their coding conventions or style guides

  • If you recognize the need for a feature that you and others may use in your work, please consider contributing this feature yourself. First by opening an Issue, if one associated to this feature is not already open.

  • Readability and accessibility are very important aspects to make a codebase inclusive. For instance, snake_case or CamelCase are more accessible than nocase for people who have vision problems.

Forks and branches#

Create your own fork of the desired repository and check out your copy. For example:

$ git clone https://github.com/sdsu-comp605/spring25.git
$ cd spring25
$ git remote add your_username https://yourusername@github.com/sdsu-comp605/spring25.git

What is the difference between using the https or the ssh protocol? If you use https the primary downside for most people is that you must enter your Git password/token every time you push. While it gets added to a cache, it’s not configured to cache permanently (though this can be changed). With SSH keys, it just uses the key file on disk every time so you don’t have to enter your Git password/token at every push. Check this guide to generate SSH keys and add them to the ssh-agent.

Now you have two remote repositories: origin, which is your fork (hosted on your cloud account), and upstream, which is the main sdsu-comp605/spring25 repository.

When part of an organization, people who contribute to an open-source project do not generally fork the repository, but clone it directly on their local machine. If you are not sure about the differences between forking and cloning, check this helpful article.

Create a branch for your feature; this will hold your contribution:

$ git checkout -b <branchname>

Basic steps: checking out a remote repository and get updates#

  • When you start working on a new feature branch, make sure you start from main by running: git checkout main.

  • When you create a new branch and check it out, as in git checkout -b <branchname>, a common convention is to make branchname something along the lines of <yourname>/<short-description>.

  • Every time you want to get updates from a remote repository, use git fetch or git pull.

  • Difference between git fetch and git pull: git fetch updates your remote-tracking branches. This operation is safe to run at any time since it never changes any of your local branches. git pull brings a local branch up-to-date with its remote version, while also updating your other remote-tracking branches. In the simplest terms, git pull does a git fetch followed by a git merge.

  • If you are working in a very large collaborative environment in which a lot of people are pushing their work every day using different branches, you may want to use git fetch and add the --prune option so that you don’t download stale branches (e.g., git fetch --prune), before pulling.

  • To download upstream changes and update the copy of the repo on your local machine use git merge or git pull (which, again, combines git fetch and git merge in one single command, so it’s more convenient).

    • Be careful: this might trigger a merge warning if you have some unsaved or uncommitted changes locally. Always commit (or stash) your work and have a clean git status before attempting merges.

Commit and contribute your own changes#

  • Once you have made any change to any file in the repository (e.g., you might have written some code or documentation, or updated files such as License, Contributing, etc.) inspect these changes by running git status.

  • Commit all files changed: git commit -a or first add selected files via

    • git add <file1> <file2> (or more briefly, git add -u to stage all updated files at once) followed by git commit.

  • New or modified files can be added to a commit in the same way.

  • Push feature branch to the remote for review: git push origin <branchname>

  • Use descriptive, useful commit messages (you can eaither do this via the text editor or dirctly at command line via git commit -m).

Useful tips#

When cleaning up your local branches, some of the following commands might be useful:

  • Show local and remote-tracking branches: git branch -a.

  • Show available remotes: git remote -v.

  • Show all branches available on remote: git ls-remote. Use git remote show origin for a complete summary.

  • Delete a local branch: git branch -D <branchname> (only after merge to main is complete).

  • Delete remote branch: git push origin :<branchname> (mind the colon in front of the branch name).

Additionally, when debugging or inspecting the code for some potentially problematic changes introduced, some of the following commands can be used:

  • Show logs: git log, or more advanced, git reflog (the latter includes any rebase and commit reshuffling you might have done).

  • Show logs for file or folder: git log <file>.

  • Show changes for each log: git log -p (add file or folder name if required).

  • Show diff with current working tree: git diff path/to/file.

  • Show diff with other commit: git diff <SHA1> path/to/file.

  • Compare version of file in two commits: git diff <SHA1> <SHA1> path/to/file.

  • Show changes that are in main, but not yet in my current branch: git log..main.

  • Discard changes to a file which are not yet committed: git checkout <file>. (If the file was aready staged via git add <file>, then use git restore <file> first, and then git checkout <file> to discard local changes).

  • Discard all changes to the current working tree: git checkout -f.

Push changes to your own repository#

If you want your changes on your local machine to go onto your online repository, while sitting on the main branch, do

git push origin main

However, it is not a good practice to work off the main branch.

Contributing to remote repositories#

To propose your contribution to any community project you want to create a so-called Pull Request (PR) in GitHub lingo or a Merge Request (MR) in GitLab lingo. The repository maintainers will review your proposed changes, make comments if needed, request further changes and reject or approve your contributions. See this GitHub documentation page on creating Pull Requests.

Code reviews#

Code reviews are possibly the most under-rated service in open-source/community projects. The best way to get familiar with a code base is reviewing other contributors’ code.

Exercise 1.1

Initialize a git repository (you may call it test) via git init

Often, when we run a numerical experiment/example, there may be output files generated (plots, datasets, etc). We usually don’t want to store the output in the repository, so we can tell git to ignore them. There are a couple of ways we can specify this:

  • .gitignore: this is specific to this repository, and is checked into the repository

    • output or temporary files that are specific to this experiment

  • ~/.config/git/ignore: this is specific to your machine

    • temporary files used by editors (emacs, vim) or more sofisticated Integrated Development Environments (IDEs) like VSCode

    • .DS_Store files on Mac

  • .git/info/exclude: specific to this repository and your machine (not checked into git)

    • occasionally useful

Files are specified according to patterns, e.g.

*.vscode # ignore all hidden files created by VS Code
!test.nc # with a preceding bang, this will exclude `test.nc` from the gitignore

See Git manual: gitignore for more information

Exercise 1.2

Set up your .gitignore file in this test/ directory

Don’t lose your output#

If you run many numerical experiments, you don’t want to overwrite your previous results (maybe that previous experiment was actually better than our most recent one!).

  • A good practice for managing output of experiments and avoid output being overwirtten is to concatenate parameters as strings in the output folder name so that each experiment will create a unique output folder, and this should facilitate experiments book-keeping.

    • But what if you want to repeat the same experiment with exactly the same parameters? This will overwite your concatenated parameter string output folder name as well. Some people prefer using a time stamp (so that you are sure it is unique) but it might be obscure to visualize and remember which folder/experiment pair that output belongs to. Some other people use a combination of these two methods (time stamp + concatenated parameters string).

    • There is no right way of doing this, as long as you can find what works best for your own bookkeeping and be able to easily fetch an old result when needed (if you do research, sometimes you may be asked to show results from a couple of years ago!)

Suggested readings and resources#

3. Package environments and managers#

3.1 Julia#

Tip

If you haven’t already, download and install Julia.

Julia is a relatively new programming language. Think of it as MATLAB done right, open source, and fast. It’s nominally general-purpose, but mostly for numerical/scientific/statistical computing. There are great learning resources. We’ll introduce concepts and language features as we go.

The Julia REPL#

Julia has an interactive command-line REPL (read-eval-print loop). See the dedicated REPL documentation page.

Different modes:

  • The Julia interpreter:

julia> 1+1
2
  • Help mode. Toggled by ?

julia> ? # upon typing ?, the prompt changes (in place) to: help?>

help?> 
  • The shell mode, toggled by ; and immediately followed by a shell command (if you are using Jupyter notebook; otherwise, you can go to a new line with the shell command)

julia> ; echo hello # upon typing ;, the prompt changes (in place) to: shell>

shell> echo hello
hello
  • The package manager mode, toggled by ]

Julia Pkg.jl#

The Julia environment is tracked with two files:

  • Project.toml: contains “user-editable” information:

    • dependencies you use directly (e.g. loaded by using or import)

    • dependency version constraints (optional)

    • package metadata (optional: only if the project is also a package)

  • Manifest.toml: contains the “derived” information necessary to recreate the environment exactly

    • all the recursive dependencies (i.e. the dependencies, their dependencies, etc.)

    • the exact version used (version number / branch name + git sha hash)

To be able to reproduce your code, you need to check in both files to git

To initiate a new environment, start julia from a directory of your choice with the --project=. option. If you don’t have a Project.toml file already in the current directory, this will automatically create an environment in your current working directory:

julia --project=.

If instead, you want to use a different path from your current working directory in which a Project.toml already exists, use

julia --project=<relative/path/to/Project.toml>

You can then recreate this environment on another machine by doing:

using Pkg
Pkg.instantiate()

or at the REPL:

]instantiate
  • if you have the Manifest.toml, this will create an environment with the exact versions used

  • if you don’t have the Manifest.toml, this will resolve it (see below)

Specifying dependencies:

  • ]add PackageName / Pkg.add("PackageName"): this will add the most recent registered version

  • ]add PackageName @version / Pkg.add(name="PackageName", version="version"): use a specific version

Other useful Pkg operations:

  • ]resolve / Pkg.resolve(): creates or updates the Manifest if it isn’t in sync with the Project.toml or its dependencies

  • ]up / Pkg.update(): update dependencies to their latest versions (subject to compatiblity contraints)

    • This will modify the Manifest.toml file

  • To create a new Julia project, create a directory for it and then activate that directory to make it the “active project”, which package operations manipulate:

    ]activate MyProject
    
  • To use an existing environment local to a specific path rather than your default system environment, invoke julia with the --project flag, as in:

    julia --project=<path>
    
  • To run a script from the command line using the environment in the current directory you can run

    julia --project=. myscript.jl
    
  • Note that even if you do not specify the current directory with the period ., julia will try to find the environment to launch in the current directory, so julia --project=. is equivalent to julia --project

3.2 Python#

Analog reproducible environments in Python can be obtained via Conda. For example

conda env export > environment.yml

will give you an exact snapshot of the packages installed, which can be stored in version control. It can be reproduced with

conda env create -f environment.yml

Unlike Julia, Conda environments are not reproducible across platforms (e.g. MacOS => Linux). And they can also export only the direct dependencies:

conda env --from-history export > environment.yml

but it may resolve differently. See Conda: manage environments: sharing an environment for more information.

pipenv (another Python environment manager) has pipfile and pipfile.lock: roughly analogous to Project.toml and Manifest.toml, but again the pipfile.lock is tied to a specific platform.

Containers#

Containers are another way to create an isolated environment that can be easily shared. These include:

  • Docker: widely used

  • Singularity (now Apptainer): supported on several HPC systems.

4. IDEs or editors#

We mentioned Integrated Development Environments (IDEs) briefly before. These are very fancy “editors”. Strictly speaking, you can modify code with any text editor. You don’t necessarily need to use an IDE. But using an IDE might make your like a lot easier and can help boost your productivity.

An IDE normally consists of at least a source-code editor, build automation tools, and a debugger. But they can also have nice features like syntax highliting, code completion, refactoring, code search, and language support, just to name a few.

One of the most popular IDEs at the moment is VSCode. VSCode supports many different “plug-ins” or extensions that can facilitate your work. One of the most useful one, if you want to collaborate live with others, is the LiveShare extension. LiveShare enables you to quickly collaborate in real time with a friend, classmate, or professor on the same code without the need to sync code or to configure the same development tools, settings, or environment.