The Problem With Packages

Modern software development now depends almost fully on software packages, mostly for their usefulness and code reuse. Unfortunately, the benefit of code reuse with packages seems to be a double-edged sword leading to “dependency hell”. Even when packages are created the right way they introduce problems.

Releasing just one version of a package already requires a high level of forethought. Without correctly tagging versions (think “latest master version”) means package consumers lose idempotency, they’re forced to re-tag works in some other complex way.

Needless updating happens when packages group unrelated classes together in the same package (for example, aws-sdk on npm is on version 2.630.0). When unrelated classes are put together, picking the right package becomes impossible when a project requires part of the package in v1, and another part in v2.

When carefully looking at these problems, the answer is clear: release more packages with specific purposes. This creates a new problem: the structuring of many packages. It’s easy to make a cycle in a dependency graph, where one change becomes unpredictable and hard to debug. It’s also easy to create dependencies in a way that a small unstable package can cause a cascade through all consuming packages, requiring many more package releases than just one isolated topic.

When solving for these problems it means means many small packages should be structured in a graph with no cycles, where the most stable packages are the most depended on, and the least stable packages have the least dependencies. This is a challenge because the the most reused packages need to be the most abstracted (hence complicated) packages.

With many versions of one package released package consumers will want easy upgrade paths, so a way to isolate bugs, features, and breaking changes. Listing a range of supported versions of a dependency makes perfect sense to get this done, and usually this is done with SemVer. The downside is that literally anything can be in these packages. For example, a release with a new feature might break backwards compatibility, but the author didn’t anticipate such an issue. Alternatively, perhaps it’s just a one-line package dependency change that forces the project to upgrade an entire framework. Projects are at the mercy of all package maintainers to understand how their changes might break other projects (which is an extremely difficult task).

Even when assuming everything was done correctly there are still issues. Assume a project has many small dependencies, organized correctly, and made intelligent usage of SemVer to negotiate package versions. Assume every release was perfect, and had no human errors. As the number of packages and releases grows, so does the time it takes to compute a simple package update or add a new dependency.

I’ve seen situations where this can consume well over 6GiB of memory for a simple update. The problem is data availability and computing the changes to the dependency graph. Each project-level dependency must first find all compatible versions of dependencies, then explore all the dependencies of dependencies in a similar way (very similar to an n+1 problem, but worse because it’s recursively n+1). As installable candidates are queried and found, the globally compatible subset of installable candidates needs to be filtered down. A few systems like npm have a benefit here: dependencies are installed recursively instead of on one level (unless using —flat). This permits a large portion of the algorithm to be entirely ignored. Release version cache sounds like a good approach, but as a new release somewhere in the package graph invalidates previous solutions that use it (and don’t forget, different versions of one package can have entirely different dependencies too). It seems the only way to mitigate this (without writing a new package manager) is to reduce total the number of dependencies, or accept the problems discussed earlier regarding package authoring.

Assuming a project doesn’t encounter any of the problems discussed so far: it’s still possible to get different behaviour from the same software. For example: when requesting a file listing (“file glob”) on Ubuntu Bionic and Trusty, it seems the filesystem call returns entirely predictable and consistent ordering, but that order is different between the two releases. It seems the Kernel itself caused this change, but there’s no way to show a project “depends” on this behaviour using modern tooling. There are many more of these corner cases that aren’t related to the packages used.

My goal with discussing these issues is not to discourage use of packages. They’re still highly valuable and make development better than the alternative. My main focus here is to raise awareness of how small changes can cascade across an ecosystem, and hopefully attract some interest into the next generation of package management. Remember, keep your packages small, keep your dependency graph stable, and release thoughtfully.

Right now there are some open problems in package management:

Computing dependency graph updates quickly
Tooling to check that a package remains tiny and has only one job (classes are tightly related, and the reasons to change are just one)
Organizing a stable package dependency graph to prevent cascading changes to every requiring module.
Expressing a package API to negotiate compatibility (in theory, it could dictate the next SemVer release version)
Alternative ways to define self-negotiating package versions (eg: SemVer)
Expressing dependency on behaviours and fulfilled contracts instead of specific packages and version ranges
How to require aspects of the runtime environment that are not part of the code.

If you find this topic interesting, here’s a few links you can dive deeper into:

An Intuitive Guide to Dependencies
Package Principles (or the book “ Principles of Package Design” by Matthias Noback)
So you want to write a package manager
GNU Guix

Brian GrahamFebruary 25, 2020

Articles and Posts

The Problem With Packages