25 Packages to distribute code
Packages are ways of distributing software. They live at the intermediate level in the stack, above infrastructure like operating systems and languages, and below end-user code (especially scripts and notebooks). Good examples are pandas or scipy. The ability for packages to build on each other is much of the power of open source software.That said, packages are also used internally within organizations to share code.
We discussed packages as one of the elements in the material on Stack, Streak and Ecosystem: Chapter 11
Packages enable us to build on the work of others, when we install packages and import them into our code. Mostly they are things that we install and use.
Yet eventually we will have code that we want to distribute to others, whether that be classmates or others within our organization, or perhaps the world at large.
There are a few ways that we can distribute code, all with pros and cons.
- Copy and paste.
We can send code to others by simply copying and pasting, then sending perhaps by Slack or even through email. A slightly more advanced approach is to use a gist or pastebin such as https://gist.github.com/ or http://pastebin.com. Copy and pasting is quick but it is limited. We then have no way of updating the code, or of knowing where it is used.
- Git and Github
We can, of course, share code via GitHub or GitLab. Here we can publish a repository, either publicly or within our Organization, and give people the URL. Then people can clone the code into their individual workspaces. At least then the potential user can re-visit for updates and our user would know where they could contact people for issues or even to share improvements. But overall, GitHub on its own is good for coordinating development, but not for distribution.
- Package Distribution systems
To respond to the need for updates, dependency management, and distribution, the open source community developed package management systems. IN the python world pip is a great example. These offer much greater effectiveness than distribution through Github: the package publisher can provide updates, bugfixes or improvements, and the user can be informed about when updates are available and have them automatically downloaded.
25.1 Advantages of Package Management Systems
Packages and package management offer a few other important advantages:
- Packages keep code cleaner by creating
namespacesfor functions that tend to go together. - Packages can be used to set up and maintain
virtual environmentsso that users can have different versions of packages installed. These can be crucial to help teams know that they are working with the same underlying code. - Analyses can be more reproducible
- Packages enable organizations to share code effectively among their teams
- Teams can benefit from the work of others so that everyone can be more efficient
- Different groups can all use the same packages, making their solutions more consistent, so that analyses are comparable.
- Packages can provide an appropriate location for running tests
- Packages provide a way to scale up software. Data Scientists often develop code in a Notebook … which is very convenient for the analyst but not something that can be made into a web API which thousands of developers or products can use in real time. When we build packages they can be deployed in the cloud, and turned into microservices using services like Kubernetes.
25.2 What is needed for package management? (four steps)
Packaging can be thought about in four steps:
- Encapsulating code in a chunk that can be moved around
- Metadata that describes the package, including authorship, licensing and, crucially, dependencies.
- An installation tool that can bring packages from the server into local environments, and check and manage dependencies.
- A server location to which packages can be published
Different languages and software ecosystems implement these steps themselves.
In Python, for examples, Encapsulation is done with directories, which have files inside with the code (“regular” some_code_name.py files. Metadata is done with additional files inside the directories. The tool pip handles both installation and dependency resolution. For the server that hosts the packages there is one broad public server called PyPI (which stands for the PYthon Package Index), but it is also possible for individual organizations to run a version “behind the firewall” to manage packages within their organization.
In R, encapsulation is also done through directories, metadata through files. Package installation is done within the R base software. In R there are a few important central repositories, including CRAN (Comprehensive R Archive Network) and Bioconductor (a separate package manager focused on biology packages. Given that they are into biology and developed a separate package manager that pretty much does the same thing, we might think of that as an example of convergent evolution). There are lots of different mirrors for CRAN.
One way that code moves from GitHub to servers like PyPI is through GitHub Actions (which we already touched on when learning about CI). See details here: https://packaging.python.org/en/latest/guides/publishing-package-distribution-releases-using-github-actions-ci-cd-workflows/
25.3 Internal Package Management at Companies
Large organizations rarely rely only on the public package managers directly. One practice is to run a private package registry—essentially their own PyPI server— that sits between developers and the public internet. This serves several purposes: hosting proprietary internal packages that can’t be published publicly, caching public PyPI packages for faster installs and protection against outages, and controlling which package versions developers are allowed to use for compliance or security reasons.
The main options for hosting a private registry are:
- Self-hosted: tools like
devpi(the standard open-source choice) orpypiserverrun on your own infrastructure.devpiadds caching of PyPI packages, multiple indexes, replication, and user management;pypiserveris simpler — it just serves packages from a directory. - Cloud-managed: AWS CodeArtifact and GCP Artifact Registry are fully managed services that handle infrastructure, authentication, and PyPI proxying. CodeArtifact uses IAM for access control, which is convenient if a team is already invested in AWS.
- Enterprise products: JFrog Artifactory supports PyPI repositories including local, remote (proxy/cache), and virtual (aggregated) types, and is common at large firms that manage packages across many languages.
From a career standpoint, internal packaging practices are a useful window into how mature a company’s engineering culture is. Good questions to ask: Do you have a private package registry? How do teams publish and version internal libraries? How do you manage dependency conflicts across teams? Companies with strong practices here tend to have thought carefully about reproducibility and code reuse.
In the R world the company that builds Rstudio now offers package management for both R and Python https://packagemanager.posit.co/client/#/ both publicly and within companies. Similarly Anaconda a company based in Austin offers package management approaches as well (and sometimes hosts meetups).
25.4 Exercises (to be started in class and completed for homework)
25.4.1 Exercise 1: From code to published package in Python
Work through a Data Camp course called Developing Python Packages.
The course shows how to take code and wrap it in a directory with special files, then how to publish it to PyPI (using a special repository just for testing).
25.4.2 Exercise 2: Trace a package you’ve used.
Submit a Markdown file with answers to these questions.
- Identify a package that you have used in the past.
- Where is it developed? Find the repository. How do you know this is the right one?
- Trace its path to your computer by identifying each of the four packaging elements:
- Encapsulation: What does the package directory structure look like in the repo?
- Metadata: Find the metadata file. What dependencies does it declare?
- Distribution server: Where is it published? Find its PyPI page.
- Install tool: How did you install it, and how did
pipknow where to get it from?
- How did the code get from the repository to PyPI? Can you find evidence of the release process (e.g., GitHub Actions, tagged releases)?
- (bonus) Can you find the package code on your own computer? Python packages know where they are installed. Can you find, write, or generate code that tells you the location of your package on your disk?