Downloading hundreds of git repos
I wanted to run some machine learning over a few hundred large git repos. I tried a few methods and eventually discovered the simplest answer was really the best.
Initially I thought I would write a quick node script that would use nodegit
and through2-concurrent
to download my repos. Something like this:
.pipe(through2Concurrent.obj( {maxConcurrency: 100}, function (line, enc, done) { const repo_git_address = "git@github.com:" + line; nodegit.Clone(repo_git_address, "./" + repo_name, {}) .then(done); // ... },
I found two immediate problems here:
Memory consumption in node can grow easily
nodegit was much slower than command-line git.
What I really needed for my research were commits, and git itself is great at downloading commits. So I created a bare repo and added thousands of remote addresses, and I'll just let git figure out how to download them all. Easy!
git init --bare all_repos cd all_repos git add remote repo1 repo1-address git add remote repo2 repo2-address git add remote repo3 repo3-address # Etc.... git fetch --all
This would download all my commits, however, one-repo-at-a-time. A full download would have taken far longer than I was willing to wait for.
Well, it turns out git recently added new features to git, allowing multiple jobs to fetch in parallel.
git fetch --all --multiple --jobs=100
Previously I would have had to create hundreds of individual new repositories and run a script to manage their fetching individually. This is much easier.