From a Single Repo, to Multi-Repos, to Monorepo, to Multi-Monorepo
Publikováno: 17.8.2021
I’ve been working on the same project for several years. Its initial version was a huge monolithic app containing thousands of files. It was poorly architected and non-reusable, but was hosted in a single repo making it easy to work …
The post From a Single Repo, to Multi-Repos, to Monorepo, to Multi-Monorepo appeared first on CSS-Tricks. You can support CSS-Tricks by being an MVP Supporter.
I’ve been working on the same project for several years. Its initial version was a huge monolithic app containing thousands of files. It was poorly architected and non-reusable, but was hosted in a single repo making it easy to work with. Later, I “fixed” the mess in the project by splitting the codebase into autonomous packages, hosting each of them on its own repo, and managing them with Composer. The codebase became properly architected and reusable, but being split across multiple repos made it a lot more difficult to work with.
As the code was reformatted time and again, its hosting in the repo also had to adapt, going from the initial single repo, to multiple repos, to a monorepo, to what may be called a “multi-monorepo.”
Let me take you on the journey of how this took place, explaining why and when I felt I had to switch to a new approach. The journey consists of four stages (so far!) so let’s break it down like that.
Stage 1: Single repo
The project is leoloso/PoP
and it’s been through several hosting schemes, following how its code was re-architected at different times.
It was born as this WordPress site, comprising a theme and several plugins. All of the code was hosted together in the same repo.
Some time later, I needed another site with similar features so I went the quick and easy way: I duplicated the theme and added its own custom plugins, all in the same repo. I got the new site running in no time.
I did the same for another site, and then another one, and another one. Eventually the repo was hosting some 10 sites, comprising thousands of files.
Issues with the single repo
While this setup made it easy to spin up new sites, it didn’t scale well at all. The big thing is that a single change involved searching for the same string across all 10 sites. That was completely unmanageable. Let’s just say that copy/paste/search/replace became a routine thing for me.
So it was time to start coding PHP the right way.
Stage 2: Multirepo
Fast forward a couple of years. I completely split the application into PHP packages, managed via Composer and dependency injection.
Composer uses Packagist as its main PHP package repository. In order to publish a package, Packagist requires a composer.json
file placed at the root of the package’s repo. That means we are unable to have multiple PHP packages, each of them with its own composer.json
hosted on the same repo.
As a consequence, I had to switch from hosting all of the code in the single leoloso/PoP
repo, to using multiple repos, with one repo per PHP package. To help manage them, I created the organization “PoP” in GitHub and hosted all repos there, including getpop/root
, getpop/component-model
, getpop/engine
, and many others.
Issues with the multirepo
Handling a multirepo can be easy when you have a handful of PHP packages. But in my case, the codebase comprised over 200 PHP packages. Managing them was no fun.
The reason that the project was split into so many packages is because I also decoupled the code from WordPress (so that these could also be used with other CMSs), for which every package must be very granular, dealing with a single goal.
Now, 200 packages is not ordinary. But even if a project comprises only 10 packages, it can be difficult to manage across 10 repositories. That’s because every package must be versioned, and every version of a package depends on some version of another package. When creating pull requests, we need to configure the composer.json
file on every package to use the corresponding development branch of its dependencies. It’s cumbersome and bureaucratic.
I ended up not using feature branches at all, at least in my case, and simply pointed every package to the dev-master
version of its dependencies (i.e. I was not versioning packages). I wouldn’t be surprised to learn that this is a common practice more often than not.
There are tools to help manage multiple repos, like meta. It creates a project composed of multiple repos and doing git commit -m "some message"
on the project executes a git commit -m "some message"
command on every repo, allowing them to be in sync with each other.
However, meta will not help manage the versioning of each dependency on their composer.json
file. Even though it helps alleviate the pain, it is not a definitive solution.
So, it was time to bring all packages to the same repo.
Stage 3: Monorepo
The monorepo is a single repo that hosts the code for multiple projects. Since it hosts different packages together, we can version control them together too. This way, all packages can be published with the same version, and linked across dependencies. This makes pull requests very simple.
As I mentioned earlier, we are not able to publish PHP packages to Packagist if they are hosted on the same repo. But we can overcome this constraint by decoupling development and distribution of the code: we use the monorepo to host and edit the source code, and multiple repos (at one repo per package) to publish them to Packagist for distribution and consumption.
Switching to the Monorepo
Switching to the monorepo approach involved the following steps:
First, I created the folder structure in leoloso/PoP
to host the multiple projects. I decided to use a two-level hierarchy, first under layers/
to indicate the broader project, and then under packages/
, plugins/
, clients/
and whatnot to indicate the category.
Then, I copied all source code from all repos (getpop/engine
, getpop/component-model
, etc.) to the corresponding location for that package in the monorepo (i.e. layers/Engine/packages/engine
, layers/Engine/packages/component-model
, etc).
I didn’t need to keep the Git history of the packages, so I just copied the files with Finder. Otherwise, we can use hraban/tomono
or shopsys/monorepo-tools
to port repos into the monorepo, while preserving their Git history and commit hashes.
Next, I updated the description of all downstream repos, to start with [READ ONLY]
, such as this one.
I executed this task in bulk via GitHub’s GraphQL API. I first obtained all of the descriptions from all of the repos, with this query:
{
repositoryOwner(login: "getpop") {
repositories(first: 100) {
nodes {
id
name
description
}
}
}
}
…which returned a list like this:
{
"data": {
"repositoryOwner": {
"repositories": {
"nodes": [
{
"id": "MDEwOlJlcG9zaXRvcnkxODQ2OTYyODc=",
"name": "hooks",
"description": "Contracts to implement hooks (filters and actions) for PoP"
},
{
"id": "MDEwOlJlcG9zaXRvcnkxODU1NTQ4MDE=",
"name": "root",
"description": "Declaration of dependencies shared by all PoP components"
},
{
"id": "MDEwOlJlcG9zaXRvcnkxODYyMjczNTk=",
"name": "engine",
"description": "Engine for PoP"
}
]
}
}
}
}
From there, I copied all descriptions, added [READ ONLY]
to them, and for every repo generated a new query executing the updateRepository
GraphQL mutation:
mutation {
updateRepository(
input: {
repositoryId: "MDEwOlJlcG9zaXRvcnkxODYyMjczNTk="
description: "[READ ONLY] Engine for PoP"
}
) {
repository {
description
}
}
}
Finally, I introduced tooling to help “split the monorepo.” Using a monorepo relies on synchronizing the code between the upstream monorepo and the downstream repos, triggered whenever a pull request is merged. This action is called “splitting the monorepo.” Splitting the monorepo can be achieved with a git subtree split
command but, because I’m lazy, I’d rather use a tool.
I chose Monorepo builder, which is written in PHP. I like this tool because I can customize it with my own functionality. Other popular tools are the Git Subtree Splitter (written in Go) and Git Subsplit (bash script).
What I like about the Monorepo
I feel at home with the monorepo. The speed of development has improved because dealing with 200 packages feels pretty much like dealing with just one. The boost is most evident when refactoring the codebase, i.e. when executing updates across many packages.
The monorepo also allows me to release multiple WordPress plugins at once. All I need to do is provide a configuration to GitHub Actions via PHP code (when using the Monorepo builder) instead of hard-coding it in YAML.
To generate a WordPress plugin for distribution, I had created a generate_plugins.yml
workflow that triggers when creating a release. With the monorepo, I have adapted it to generate not just one, but multiple plugins, configured via PHP through a custom command in plugin-config-entries-json
, and invoked like this in GitHub Actions:
- id: output_data
run: |
echo "quot;::set-output name=plugin_config_entries::$(vendor/bin/monorepo-builder plugin-config-entries-json)"
This way, I can generate my GraphQL API plugin and other plugins hosted in the monorepo all at once. The configuration defined via PHP is this one.
class PluginDataSource
{
public function getPluginConfigEntries(): array
{
return [
// GraphQL API for WordPress
[
'path' => 'layers/GraphQLAPIForWP/plugins/graphql-api-for-wp',
'zip_file' => 'graphql-api.zip',
'main_file' => 'graphql-api.php',
'dist_repo_organization' => 'GraphQLAPI',
'dist_repo_name' => 'graphql-api-for-wp-dist',
],
// GraphQL API - Extension Demo
[
'path' => 'layers/GraphQLAPIForWP/plugins/extension-demo',
'zip_file' => 'graphql-api-extension-demo.zip',
'main_file' =>; 'graphql-api-extension-demo.php',
'dist_repo_organization' => 'GraphQLAPI',
'dist_repo_name' => 'extension-demo-dist',
],
];
}
}
When creating a release, the plugins are generated via GitHub Actions.
If, in the future, I add the code for yet another plugin to the repo, it will also be generated without any trouble. Investing some time and energy producing this setup now will definitely save plenty of time and energy in the future.
Issues with the Monorepo
I believe the monorepo is particularly useful when all packages are coded in the same programming language, tightly coupled, and relying on the same tooling. If instead we have multiple projects based on different programming languages (such as JavaScript and PHP), composed of unrelated parts (such as the main website code and a subdomain that handles newsletter subscriptions), or tooling (such as PHPUnit and Jest), then I don’t believe the monorepo provides much of an advantage.
That said, there are downsides to the monorepo:
- We must use the same license for all of the code hosted in the monorepo; otherwise, we’re unable to add a
LICENSE.md
file at the root of the monorepo and have GitHub pick it up automatically. Indeed,leoloso/PoP
initially provided several libraries using MIT and the plugin using GPLv2. So, I decided to simplify it using the lowest common denominator between them, which is GPLv2. - There is a lot of code, a lot of documentation, and plenty of issues, all from different projects. As such, potential contributors that were attracted to a specific project can easily get confused.
- When tagging the code, all packages are versioned independently with that tag whether their particular code was updated or not. This is an issue with the Monorepo builder and not necessarily with the monorepo approach (Symfony has solved this problem for its monorepo).
- The issues board needs proper management. In particular, it requires labels to assign issues to the corresponding project, or risk it becoming chaotic.
All these issues are not roadblocks though. I can cope with them. However, there is an issue that the monorepo cannot help me with: hosting both public and private code together.
I’m planning to create a “PRO” version of my plugin which I plan to host in a private repo. However, the code in the repo is either public or private, so I’m unable to host my private code in the public leoloso/PoP
repo. At the same time, I want to keep using my setup for the private repo too, particularly the generate_plugins.yml
workflow (which already scopes the plugin and downgrades its code from PHP 8.0 to 7.1) and its possibility to configure it via PHP. And I want to keep it DRY, avoiding copy/pastes.
It was time to switch to the multi-monorepo.
Stage 4: Multi-monorepo
The multi-monorepo approach consists of different monorepos sharing their files with each other, linked via Git submodules. At its most basic, a multi-monorepo comprises two monorepos: an autonomous upstream monorepo, and a downstream monorepo that embeds the upstream repo as a Git submodule that’s able to access its files:
This approach satisfies my requirements by:
- having the public repo
leoloso/PoP
be the upstream monorepo, and - creating a private repo
leoloso/GraphQLAPI-PRO
that serves as the downstream monorepo.
leoloso/GraphQLAPI-PRO
embeds leoloso/PoP
under subfolder submodules/PoP
(notice how GitHub links to the specific commit of the embedded repo):
Now, leoloso/GraphQLAPI-PRO
can access all the files from leoloso/PoP
. For instance, script ci/downgrade/downgrade_code.sh
from leoloso/PoP
(which downgrades the code from PHP 8.0 to 7.1) can be accessed under submodules/PoP/ci/downgrade/downgrade_code.sh
.
In addition, the downstream repo can load the PHP code from the upstream repo and even extend it. This way, the configuration to generate the public WordPress plugins can be overridden to produce the PRO plugin versions instead:
class PluginDataSource extends UpstreamPluginDataSource
{
public function getPluginConfigEntries(): array
{
return [
// GraphQL API PRO
[
'path' => 'layers/GraphQLAPIForWP/plugins/graphql-api-pro',
'zip_file' => 'graphql-api-pro.zip',
'main_file' => 'graphql-api-pro.php',
'dist_repo_organization' => 'GraphQLAPI-PRO',
'dist_repo_name' => 'graphql-api-pro-dist',
],
// GraphQL API Extensions
// Google Translate
[
'path' => 'layers/GraphQLAPIForWP/plugins/google-translate',
'zip_file' => 'graphql-api-google-translate.zip',
'main_file' => 'graphql-api-google-translate.php',
'dist_repo_organization' => 'GraphQLAPI-PRO',
'dist_repo_name' => 'graphql-api-google-translate-dist',
],
// Events Manager
[
'path' => 'layers/GraphQLAPIForWP/plugins/events-manager',
'zip_file' => 'graphql-api-events-manager.zip',
'main_file' => 'graphql-api-events-manager.php',
'dist_repo_organization' => 'GraphQLAPI-PRO',
'dist_repo_name' => 'graphql-api-events-manager-dist',
],
];
}
}
GitHub Actions will only load workflows from under .github/workflows
, and the upstream workflows are under submodules/PoP/.github/workflows
; hence we need to copy them. This is not ideal, though we can avoid editing the copied workflows and treat the upstream files as the single source of truth.
To copy the workflows over, a simple Composer script can do:
{
"scripts": {
"copy-workflows": [
"php -r \"copy('submodules/PoP/.github/workflows/generate_plugins.yml', '.github/workflows/generate_plugins.yml');\"",
"php -r \"copy('submodules/PoP/.github/workflows/split_monorepo.yaml', '.github/workflows/split_monorepo.yaml');\""
]
}
}
Then, each time I edit the workflows in the upstream monorepo, I also copy them to the downstream monorepo by executing the following command:
composer copy-workflows
Once this setup is in place, the private repo generates its own plugins by reusing the workflow from the public repo:
I am extremely satisfied with this approach. I feel it has removed all of the burden from my shoulders concerning the way projects are managed. I read about a WordPress plugin author complaining that managing the releases of his 10+ plugins was taking a considerable amount of time. That doesn’t happen here—after I merge my pull request, both public and private plugins are generated automatically, like magic.
Issues with the multi-monorepo
First off, it leaks. Ideally, leoloso/PoP
should be completely autonomous and unaware that it is used as an upstream monorepo in a grander scheme—but that’s not the case.
When doing git checkout
, the downstream monorepo must pass the --recurse-submodules
option as to also checkout the submodules. In the GitHub Actions workflows for the private repo, the checkout must be done like this:
- uses: actions/checkout@v2
with:
submodules: recursive
As a result, we have to input submodules: recursive
to the downstream workflow, but not to the upstream one even though they both use the same source file.
To solve this while maintaining the public monorepo as the single source of truth, the workflows in leoloso/PoP
are injected the value for submodules
via an environment variable CHECKOUT_SUBMODULES
, like this:
env:
CHECKOUT_SUBMODULES: "";
jobs:
provide_data:
steps:
- uses: actions/checkout@v2
with:
submodules: ${{ env.CHECKOUT_SUBMODULES }}
The environment value is empty for the upstream monorepo, so doing submodules: ""
works well. And then, when copying over the workflows from upstream to downstream, I replace the value of the environment variable to "recursive"
so that it becomes:
env:
CHECKOUT_SUBMODULES: "recursive"
(I have a PHP command to do the replacement, but we could also pipe sed
in the copy-workflows
composer script.)
This leakage reveals another issue with this setup: I must review all contributions to the public repo before they are merged, or they could break something downstream. The contributors would also completely unaware of those leakages (and they couldn’t be blamed for it). This situation is specific to the public/private-monorepo setup, where I am the only person who is aware of the full setup. While I share access to the public repo, I am the only one accessing the private one.
As an example of how things could go wrong, a contributor to leoloso/PoP
might remove CHECKOUT_SUBMODULES: ""
since it is superfluous. What the contributor doesn’t know is that, while that line is not needed, removing it will break the private repo.
I guess I need to add a warning!
env:
### ☠️ Do not delete this line! Or bad things will happen! ☠️
CHECKOUT_SUBMODULES: ""
Wrapping up
My repo has gone through quite a journey, being adapted to the new requirements of my code and application at different stages:
- It started as a single repo, hosting a monolithic app.
- It became a multirepo when splitting the app into packages.
- It was switched to a monorepo to better manage all the packages.
- It was upgraded to a multi-monorepo to share files with a private monorepo.
Context means everything, so there is no “best” approach here—only solutions that are more or less suitable to different scenarios.
Has my repo reached the end of its journey? Who knows? The multi-monorepo satisfies my current requirements, but it hosts all private plugins together. If I ever need to grant contractors access to a specific private plugin, while preventing them to access other code, then the monorepo may no longer be the ideal solution for me, and I’ll need to iterate again.
I hope you have enjoyed the journey. And, if you have any ideas or examples from your own experiences, I’d love to hear about them in the comments.
The post From a Single Repo, to Multi-Repos, to Monorepo, to Multi-Monorepo appeared first on CSS-Tricks. You can support CSS-Tricks by being an MVP Supporter.