robert lin

Scaling Sourcegraph’s managed multi-single-tenant product

2024-08-23T00:00:00+00:00

As the customer base for Sourcegraph’s “multi-single tenant” Sourcegraph Cloud offering grew, I had the opportunity to join the team to scale out the platform to support the hundreds of instances the company aimed to reach - which it does today!

Sourcegraph’s first stab at a managed offering of our traditionally self-hosted, on-premises code search product started way back during my internship at Sourcegraph. Dubbed “managed instances”, this was a “multi-single tenant” product where each “instance” was a normal Sourcegraph deployment operated on isolated infrastructure managed by the company. A rushed implementation was built to serve the very small number of customers that were initially interested in a managed Sourcegraph offering.

Managed Sourcegraph instances proved to be a good model for customers and Sourcegraph engineers alike: customers did not need to deal with the hassle of managing infrastructure and upgrades, and Sourcegraph engineers had direct access to diagnose problems and ensure a smooth user experience. The multi-single-tenant model ensured customer data remained securely isolated.

The decision was made to invest more in the “managed instances” platform with the goal of bringing “Sourcegraph Cloud” to general availability, and eventually make it the preferred option for all customers onboarding to Sourcegraph. A team of talented engineers took over to build what was internally referred to as “Cloud V2”.

I’m pretty proud of the work I ended up doing on this project, the “Cloud control plane”, and am very happy to see what the project has enabled since I left the Sourcegraph Cloud team in September 2023. So I thought it might be cool to write a little bit about what we did!

The prototype
Version 2
Taking things to the control plane
The future

The prototype

The first “managed instances” was managed by copy-pasting Terraform configuration and some basic VM setup scripts. I maintained and worked on this briefly before I rejoined Sourcegraph as a full-time engineer in the Dev Experience team. Operating these first “managed instances” was a very manual ordeal. At the scale of less than a dozen instances, a fleet-wide upgrade would take several days of painstakingly performing blue-green deploys for each by copy-pasting Terraform configurations and applying them directly, one instance at a time. The only automation to speak of was some gnarly Terraform-rewriting scripts that I built using Typescript and Comby to make the task marginally less painful, and even this was prone to breaking on any unexpected formatting of the hand-written Terraform configurations.

The state of the first “managed instances” was a necessary first step to quickly serve the customers that first asked for the offering, and validate that customers were willing to allow a small company like Sourcegraph to hold the keys to their private code and secret sauce. As the customer base grew, however, upgrades were needed.

Version 2

By the time I joined the newly formed “Cloud team” that had inherited the first “managed instances” platform, the sweeping upgrades that comprised “Cloud V2” had been built, and the migration was already underway. These upgrades, largely driven by the talented and knowledgeable Michael Lin, were sorely needed: operating individual Sourcegraph instances with Kubernetes and Helm instead of docker-compose, and leveraging off-the-shelf solutions like GCP Cloud SQL and Terraform Cloud to operate the prerequisite infrastructure. CDKTF was also adopted so that Terraform manifests could be generated using a Go program, instead of being hand-written. Each instance got a YAML configuration file that was used to generate Terraform with CDKTF based on the desired attributes, which all got committed to a centralised configuration repository. These upgrades were the pieces needed to kickstart the company’s transition to bring the Cloud platform to general availability and encourage customers to consider “managed Sourcegraph” as the preferred option to self-hosting.

This infrastructure was managed by a CLI tool we called mi2, based on its predecessor mi, which stood for “managed instances”. The tool was generally run by a human operator to perform operations on the fleet of instances by manipulating its infrastructure-as-code components, such as the aforementioned CDKTF and Kubernetes manifests, based on each instance’s YAML specification. It was also used to configure “runtime” invariants such as application configuration, also based on each instance’s YAML specification.

“Cloud V2” wasn’t the end of the planned upgrades, however: defining each instance as a YAML configuration was a hint at what Michael’s grand vision for the “Cloud V2” platform was: to treat instances as Kubernetes custom resources, and manage each instance with individual “instance specifications”, just like any other native Kubernetes resource. The design of the “Cloud V2” instance specifications also featured Kubernetes-like fields, such as spec and status, similar to native Kubernetes resources like Pods, for example:

In the Kubernetes API, Pods have both a specification and an actual status. The status for a Pod object consists of a set of Pod conditions.

In other words, each instance:

…was defined by its spec: the desired state and attributes. For example, the version of Sourcegraph, the domain the instance should be served on, or the number of replicas for a particular services it should have (for services that can’t scale automatically).
…reports its status: the actual deployed state and as details that are only known after deployment, such as randomly generated resource names or IP addresses. A difference between spec and status for attributes that are reflected in both would indicate that the configuration change has not been applied yet.

When the team first launched “Cloud V2”, both spec and status were set in the configuration file, such that spec was generally handwritten, and status would be set and written to the file by the platform’s CLI, mi2. In addition, there were some generated Kustomize and Helm assets that also required a human to run some generation command with the mi2 CLI.

This meant that Git diffs representing a change usually must be made after changes have been already applied to GCP and other infrastructure, so that the status of the instance can be correctly reflected in the repository. This approach was error-prone and constantly caused drift between the repository state (where configurations were committed), and actual state of an instance in our infrastructure. Because the changes between specification and status are closely intertwined, pull requests with updates usually require review, further adding latency to the drift between actual status and the recorded status when left un-merged.

To complicate matters further, there were various other “runtime configurations” that were applied by hand using the mi2 tool. These were needed in scenarios where we did not have an infrastructure-as-code offering off-the-shelf, so we built ad-hoc automation to make API calls against various dependencies to make required configuration changes. This included configuration changes for Sourcegraph itself, and external dependencies like email delivery vendors¹.

The key problems this situation posed were:

The possibility of accidents and conflicts was very real. The consequences of mistakes were also very real, as we were highly reliant on customer trust that the service they paid for would be secure and reliable.
The overhead required to operate the fleet, though much improved from the first “managed instances”, was still high: it was very unlikely the small team could handle a fleet size in the hundreds of instances with the tooling we had.
1. To compound the problem further, instances had to be created and torn down on a frequent basis to enable customers to trial the product - this was partially automated by had to be manually triggered, and would frequently require intervention.
We started relying heavily on GitHub Actions for automation. This worked well for simple processes like “create a specification from a template and run the necessary commands to apply it”, but the number of workflows grew, and some of them got very complex. These were difficult to test and prone to typos and conflicts due to the way our “Git ops” setup worked.

To enable the Cloud V2 platform to scale out to more customers reliably, we had to take it further. Michael and I started discussing our next steps in earnest sometime in January 2023. Together, we circulated 2 RFCs within the team: RFC 775: GitOps in Sourcegraph Cloud with Cloud control plane by myself, and RFC 774: Task Framework for Sourcegraph Cloud Orchestration by Michael.²

These two RFCs formed the key building blocks of the “Cloud control plane” project.

Taking things to the control plane

In my RFC, I drew this diagram to try and illustrate the desired architecture:

There’s a lot to unpack here, but the overall gist of the plan was:

There would be no writing-to-the-repository by state changes. Operators (and operator-triggered automations) would commit changes to instances specifications (denoted by the blue box), and the required changes would somewhat opaquely happen in the “control plane”.
- This would significantly reduce conflicts we were seeing in our existing Cloud infrastructure-as-code repository, because changes would now only occur in one direction when a change is merged, without needing to write back what changed to the repository.
The platform would have a central “control plane”, denoted by the green boxes (“Cloud Manager” and “Tasks”).
- “Tasks” are an internal abstraction for serverless jobs using Cloud Run. They allow us to run arbitrary tasks that mirror the mi2 commands a human operator would run today.
- The “Cloud Manager” is the Kubernetes “controllers” service that would manage our Sourcegraph instances. We called it “manager” since that is the terminology used in kubebuilder - in the sense that a single manager service implements multiple “controllers”, and each controller owns the reconciliation of one Kubernetes custom resource type.
We would continue to rely on off-the-shelf components, like existing dependencies on Terraform Cloud and Kubernetes + Helm (illustrated by the brown boxes).

In the central “control plane”, each instance specification would be “applied” as a custom resource in Kubernetes. This is enabled by kubebuilder, which makes it easy to write custom resource definitions (CRDs) and “controllers” for managing each custom resource type.

By defining a custom resource definition, operators can interact with the instance specifications via the Kubernetes API just like any other Kubernetes resource, including using kubectl. For example:

kubectl apply -f environments/dev/deployments/src-1234/config.yaml
kubectl get instance.cloud.sourcegraph.com/src-1234 -o yaml    

Would dump the custom resource from Kubernetes:

apiVersion: cloud.sourcegraph.com/v1
kind: Instance
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {...}
  creationTimestamp: "2023-01-24T00:19:35Z"
  generation: 1
  labels:
    instance-type: trial
  name: src-1234
spec:
  # ...
status:
  # ...

I proposed a design that would build on Michael’s “Tasks” abstraction by representing each “Task” type (for example, “apply these changes to the cluster” or “update the node pool to use another machine type”) with a “subresource” in the control plane. Each subresource would be another custom resource we define, and each subresource type’s sole task would be to detect if changes to resources it owns needs to be reconciled, and execute the required “Task” to bring the relevant resources to the desired state.

graph TD
  Instance --> InstanceTerraform
  Instance --> InstanceKubernetes
  Instance --> InstanceInvariants
  Instance --> UpgradeInstanceTask
  subgraph Subresources
    InstanceTerraform --> t1[Tasks] --> tfc[(Terraform Cloud)]
    InstanceKubernetes --> t2[Tasks] --> gke[(GKE)]
    InstanceInvariants --> t3[Tasks] --> src[(Misc. APIs)]
    UpgradeInstanceTask --> t4[Tasks] --> etc[(...)]
  end

In the diagram above, InstanceTerraform is one of our “subresource” types. It manages changes to an instance’s underlying infrastructure. The example showcases an infrastructure change, for example:

Human operator updates the Instance spec to use a new machine type
Instance controller propagates the change to instance’s child InstanceTerraform spec
InstanceTerraform would detect that its current spec differs from the last known infrastructure state. It will then regenerate the updated Terraform manifests using CDKTF and apply it directly using Terraform Cloud using “Tasks”.
Once the “task” execution completes, InstanceTerraform will updates its own status, which will be reflected by the Instance. This may cause cascading changes to “subsequent” subresourcs with dependencies on the modified subresource to apply.

Operators would rarely interact directly with these subresources - instead, they would only interact with the top-level Instance definition to request changes to the underlying infrastructure. Changes to the instance specification would automatically propagate to these subresources through the top-level Instance controller. Each subresource implemented an abstraction called “task driver” that generalised the ability for the top-level Instance controller to poll for completion or errors in a uniform manner.

Updated diagram adapted from my original RFC illustrating how the parent "Instance" controller creates child "subresources", which each own reconciling a specific component of an instance's state.

Reconciliation

This was a pretty new concept for me, though Kubernetes experts out there will probably find this familiar. The idea is to achieve “eventual consistency” by repeatedly “reconciling” and object until the specified state (spec) and desired state (status) are aligned. I think the most relevant dictionary definition is:³

[…] make (one account) consistent with another, especially by allowing for transactions begun but not yet completed.

At reconciliation time, each reconcile should be idempotent - the cause of the reconciliation cannot be used to change its behaviour. The goal of Reconcile implementations should be to bring the actual state closer to the desired state. This means that you don’t need to do everything in a single reconcile: you can do one thing, and then requeue for an update - the next reconciliation should proceed to the next thing, and so on. There may be a difference between actual state and the desired state for some time, but the system will eventually shift to the correct configuration.

For example, consider reconciling object O, where O.x and O.y are not yet in the desired state.

Reconcile on object O. Fix O.x and requeue for another update immediately.
Reconcile on object O (again). O.x is now fixed, so fix O.y and requeue for another update immediately (again).
Reconcile on object O (again!). Everything is in the desired state! Do not requeue for update immediately, because all is now right in this (particular) world.

After the steps above, where O is reconciled several times, all attributes of O are now in the desired state. Nice!

Writing the code

In Kubebuilder code terms (the SDK we use to build custom Kubernetes CRDs), reconciliations are effectively the Reconcile method of a controller implementation being called repeatedly on an object in the cluster. Reconcile implementations can get pretty long, however, even from examples I looked at from other projects. Using gocyclo to evaluate the “cyclomatic complexity” (a crude measure of “how many code paths are in this function”) of the top-level Instance controller today, we get a cyclomatic complexity score almost twice as high as the rule-of-thumb “good” score of 15:

$ gocyclo ./cmd/manager/controllers/instance_controller.go
31 controllers (*InstanceReconciler).Reconcile ./cmd/manager/controllers/instance_controller.go:107:1

Even with a cyclomatic complexity score of 31, this is already fairly abstracted, as a lot of the complicated reconciliation that needs to take place by executing and tracking “Tasks” is delegated to subresource controllers. The top-level Instance controller only handles interpreting what subresources need to be updated to bring the Cloud instance to the desired state.

To keep this complexity under control, I developed a pattern for making “sub-reconcilers”: using package functions .Ensure, these mini reconcilers would accept a variety of interfaces, with a touch of generics, that help us reuse similar behaviour over many subresources. The largest of these is taskdriver.Ensure, which encapsulates most of the logic required to dispatch task executions, track their progress, and collect their output.

$ gocyclo ./cmd/manager/controllers/taskdriver/taskdriver.go
57 taskdriver Ensure ./cmd/manager/controllers/taskdriver/taskdriver.go:123:1

With a cyclomatic complexity score of 57, this implementation spans around 550 lines, and is covered by nearly 1000 lines of tests providing 72% coverage on taskdriver.Ensure - not bad for a component dealing extensively with integrations.

This investment in a robust, re-usable component has paid dividends: the abstraction serves 5 “subresources” today, each handling a different aspect of Cloud instance management, and generalises the implementation of:

Diff detection: During reconciliation you cannot (by design) refer to a “previous version” of your resource. taskdriver.Ensure handles detecting if a task execution has already been dispatched, and whether a new one needs to be dispatched for the current inputs.
Tracking Task executions: taskdriver.Ensure handles creating Task executions, tracking their status, and collecting their outputs across many reconciles. Notable events are tracked in “conditions”, an ephemeral state field that records the last N interesting events to a subresource.

Sequence of TaskDriver events as viewed in ArgoCD, from creation, to checking for completion, to detected completion.

Concurrency control: Subresources often need global concurrency management (to throttle the rate at which we hit external resources like Terraform Cloud) as well as per-instance concurrency management (e.g. an upgrade can’t happen at the same time as a kubectl apply). taskdriver.Ensure consumes a configurable concurrency controller that can be tweaked based on the workload.
Teardown and orphaned resource management: On deletion of a subresource, taskdriver.Ensure can handle “finalisation” of tasks resources, deleting past executions in GCP Cloud Run. This is most useful for one-time-use subresources like instance upgrades - over time, we can delete our records of past upgrades for an instance. taskdriver.Ensure has also since been extended to handle picking up and clearing Task executions.
Uniform observability: Logs and metrics emitted by taskdriver.Ensure allow our various subresources to be monitored the same way for alerting and debugging.

To illustrate how this works in code, because I like interfaces, here’s an abbreviated version of what the abstraction looks like:

// Object is the interface a CRD must implement for managing tasks with Ensure.
//
// Generally, each CRD should only operate one Task type.
type Object[S any] interface {
	object.Kubernetes

	// object.Specified implements the ability to retrieve the driver resource's
	// specification, which should be exactly the Task's explicit inputs.
	object.Specified[S]

	// taskdrivertypes.TaskDriver implements the ability to read condition events for Tasks.
	taskdrivertypes.TaskDriver

	// AddTaskCondition should add cond as the first element in conditions -
	// cond will be the latest condition. This is interface is unqiue to
	// taskdriver.Object, as this package is the only place we should be adding
	// conditions.
	AddTaskCondition(cond cloudv1.TaskCondition)
}

// EnsureOptions denotes parameters for taskdriver.Ensure. All fields are required.
type EnsureOptions[
	// S is the type of subresource spec
	S any,
	// TD is the type of subresource that embeds the spec
	TD Object[S],
] struct {
	// Task is the type of task runs to operate over.
	Task task.Task
	// OwnerName is used when acquiring locks, and should denote the name of the
	// owner of Resource.
	OwnerName string
	// Resource is the resource that drives tasks runs of this type, changes to
	// the generation (spec) of which should driver a re-run of this
	// reconciliation task.
	Resource TD
	// Events must be provided to record events on Resource.
	Events events.Recorder
  // ...
}

// Ensure creates a reconciliation task run if there isn't one known in
// conditions, or retrieves its status. Both return values may be nil if the
// task is in progress with no error and no result.
//
// The caller MUST call handle.Update on resource if *result.Combined is not nil.
// The caller MUST apply a Status().Update() on resource if a result is returned.
func Ensure[SpecT any, TD Object[SpecT]](
	ctx context.Context,
	logger log.Logger,
	runs task.RunProvider,
	limiter concurrency.Checker,
	opts EnsureOptions[SpecT, TD],
) (_ any, _ result.ObjectUpdate, res *result.Combined) {
  // ...
}

The big hodgepodge of interfaces allow us to do a few things:

Easy mocking in tests: Integration components can easily be provided as mock implementations for robust testing of every aspect of the taskdriver.Ensure lifecycle, which is pretty important given the complexity and business-critical nature of this one function. The taskdriver.Ensure test spans 20+ cases over 1000+ lines of assertions.
Composable interfaces: In other parts of the codebase, we will leverage small parts of a complex implementation to do other sorts of work. For example, taskdrivertypes.TaskDriver indicates that it exposes interfaces for reading a task driver’s conditions - this is a critical part of taskdriver.Ensure, but is also useful for summarization capabilities elsewhere.
Clearly express dependencies: It doesn’t matter too much what a “task run” really means in the context of taskdriver.Ensure, but it’s important to understand that the implementation needs to be able to dispatch runs and check on their status. For that we accept a task.RunProvider, and similarly, we accept a concurrency.Checker, and so on.

An abbreviated version of the callsite, a particular subresource’s reconciler, would then look like:

// Reconcile is part of the main kubernetes reconciliation loop which aims to
// move the current state of an upgrade instance task closer to the desired state.
//
// For more details, check Reconcile and its Result here:
// - https://pkg.go.dev/sigs.k8s.io/controller-runtime@v0.14.1/pkg/reconcile
func (r *UpgradeInstanceTaskReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrlResult ctrl.Result, err error) {
	// Get the resource being reconciled
	var resource cloudv1.UpgradeInstanceTask
	if err := r.Get(ctx, req.NamespacedName, &resource); err != nil {
		return ctrl.Result{}, client.IgnoreNotFound(err)
	}

	// Find the parent resource.
	instance, logger, err := taskdriver.MustResolveOwner(ctx, logger, r.Client, &resource)
	if err != nil {
		return ctrl.Result{}, client.IgnoreNotFound(err)
	}

	// Set up task execution. Upgrades are immutable task drivers, so we use
	// resource.GetName() for convenience, since our name is unique.
	runProvider, err := r.TaskRunProvider(ctx, logger, *instance, resource.GetName())
	if err != nil {
		return ctrl.Result{}, err
	}
	defer runProvider.Close()

	// Run the taskdriver loop
	_, update, resultErr := taskdriver.Ensure(ctx, logger, runProvider,
		concurrency.NewSubresourceChecker(logger, r.Client, instance.GetName(), &resource,
			// Low concurrency - we are heavily limited by TFC
			concurrency.WithGlobalTypeConcurrency(...)),
		UpgradeInstanceTaskEnsureTaskOptions{
			Task:      upgradeinstance.Task,
			OwnerName: instance.Name,
			Resource:  &resource,
			Events:    r.Events,
		})
	if resultErr != nil {
		update.Handle(ctx, logger, r.Client, &resource)
		return resultErr.Handle(logger, "EnsureTask")
	}

	return ctrl.Result{}, r.Status().Update(ctx, &resource)
}

This allows the system to be easily extended to accommodate more types of subresources to handle different tasks, allowing implementors to focus on the Task execution that gets the work done, before plugging it into the control plane with a fairly small integration surface.

Control plane lifecycle summary

Putting it all together, here’s a diagram I wrote up for some internal dev docs showing the lifecycle of a change a human operator might make to an instance:

sequenceDiagram
    participant Instance
    participant SubResource1
    participant SubResource2
    participant task.RunProvider
    Note right of Instance: Instance spec is updated
    loop InstanceController.Reconcile
      activate Instance
      Note right of Instance: Instance is continuously queued
for reconciliation based on updates,
 requeues, or SubResource updates
      alt subresource.Ensure: SubResource1 needs update
        Note right of Instance: Updates are determined by generating
desired SubResource spec and diffing
against the current SubResource spec
        Instance->>SubResource1: Apply updated SubResource1 spec
        activate SubResource1
        loop SubResource1.Reconcile
          Note left of SubResource1: Spec update triggers
SubResource1 reconciliation
          alt taskdriver.Ensure: Task input does not match spec
            Note right of SubResource1: Input diffs are identified
by recording the subresource
generation and a hash of
the annotations provided
            SubResource1->>task.RunProvider: Create new Task execution
            activate task.RunProvider
            SubResource1->>SubResource1: Update status conditions
            Note right of SubResource1: Conditions record execution
state and metadata
            SubResource1->>SubResource1: Requeue for reconcile
          else Task input matches spec
            SubResource1->>task.RunProvider: Is task still running?
            deactivate task.RunProvider
            task.RunProvider-->>SubResource1: Update Status with result
            SubResource1->>SubResource1: Update status conditions
            deactivate SubResource1
            Note left of SubResource1: Instance owns SubResource1,
so a status update will queue
an Instance reconciliation
          end
        end
      else SubResource1 up-to-date
        Instance->>SubResource1: Is SubResource1 ready?
        Note right of Instance: We determine readiness based on the
subresource status and conditions
        SubResource1-->>Instance: Update Status with result
        Note right of Instance: We do not proceed to next SubResource
unless the previous is ready
        alt subresource.Ensure: SubResource2 needs update
          Instance->>SubResource2: Apply updated SubResource2 spec
          Note right of Instance: Not every Instance spec change will
cause every SubResource to update,
because SubResources are subsets
of Instance spec
          activate SubResource2
          loop SubResource2.Reconcile
            alt taskdriver.Ensure: Task input does not match spec
              SubResource2->>task.RunProvider: Create new Task execution
              activate task.RunProvider
              SubResource2->>SubResource2: Update status conditions
              SubResource2->>SubResource2: Requeue for reconcile
            else Task input matches spec
              SubResource2->>task.RunProvider: Is task still running?
              deactivate task.RunProvider
              task.RunProvider-->>SubResource2: Update Status with result
              SubResource2->>SubResource2: Update status conditions
              deactivate SubResource2
            end
          end
        else SubResource2 up-to-date
          Instance->>SubResource2: Is SubResource2 ready?
          SubResource2-->>Instance: Update Status with result
        end
      end
      Note right of Instance: Full reconcile complete!
      deactivate Instance
    end

I don’t know if that helps much, but I think it looks nice!

The future

Sadly, I no longer work on the Sourcegraph Cloud platform, but since its launch, this system has delivered on our goals: today, the Cloud control plane operates over 150 completely isolated single-tenant Sourcegraph instances with a core team of just 2 to 3 engineers, nearly double the size of the fleet when we started this project.

The Cloud control plane has also proven extensible: I’ve seen some pretty nifty extensions built since I departed the project, like an automatic disk resizer and “ephemeral instances”, which can be used internally to deploy a branch of a Sourcegraph codebase to a temporary Cloud instance with just a few commands. Various features have also been added to accommodate scaling needs and specific customer requirements.

The rollout of the Cloud control plane, and adoption of Cloud from customers, have battle-tested the platform, and a lot of work has been done to cover more edge cases and improve the resilience of the Cloud control plane. There’s also DX improvements, such as robust support for our internal concepts in ArgoCD, allowing health and progress summaries to be surfaced in a friendly interface:

Note the parent resource ("instance") and the subresources it owns ("instanceinvariants", "instancekubernetes", and friends).

The design of the Cloud control plane has allowed all these additions to be built in a sustainable fashion for the small Cloud team that operates it. The core concepts we initially designed for the Cloud control plane have largely remained intact, which is a relief for sure. I’m very excited to see where else the team goes with the Sourcegraph Cloud offering, both internally and externally!

About Sourcegraph

Sourcegraph builds universal code search for every developer and company so they can innovate faster. We help developers and companies with billions of lines of code create the software you use every day. Learn more about Sourcegraph here.

Interested in joining? We’re hiring!

I built and launched this (“managed SMTP”), which configured an external email vendor automatically so that Cloud instances could start sending emails “off the shelf”. ↩
RFCs at Sourcegraph used to be primarily published as public Google Documents. This has become a bit rarer over the years, but hopefully this link doesn’t stop working! ↩
I just found this with a Google search - the provided definition should have a permalink here. ↩

Investing in the development of the developer experience

2022-10-10T00:00:00+00:00

At Sourcegraph we have a developer tool called sg, which has become the way we ensure the development of tooling continues to scale at Sourcegraph. But why invest in ensuring contributions to your dev tooling scales?

Imagine you’re developing a sizable application spanning multiple services - say, a code search and code intelligence platform like Sourcegraph. You’ll want to be able to spin up everything to some degree locally to help you experiment.

So you pick up an off-the-shelf tool like goreman, a Procfile runner we used to use - but this could be any tool, really, like docker-compose or something else. A tool like this usually it takes a bit of configuration, but it works good enough to start off!

goreman -f dev/Procfile

Inevitably you add a few layers of configuration specific to your project for your tool of choice:

export SRC_LOG_LEVEL=${SRC_LOG_LEVEL:-info}
export SRC_LOG_FORMAT=${SRC_LOG_FORMAT:-condensed}

goreman --set-ports=false --exit-on-error -f dev/Procfile

This ends up going in a script or Makefile, to encode this setup as the de-facto way of running things that you can share with your team.

Then you realise your tool doesn’t have hot-reloading, or some other feature, which you end up writing some automation for.

Your little start script ends up with several hundred lines of configuration options, which you can only find out by reading it, and alongside that you have dozens of scripts that do various dev-related tasks:

running adjacent services,
generating code,
or running linters,
or just parts of linters in particular ways in CI,
or combining scripts and configured them in mysterious ways…

This eventually leads to a frustrating and brittle developer experience.

It’s nearly impossible to find out which development tasks I can run. It’s really hard to run them standalone without knowing about some global state they depend on. It’s really hard to extend these, because who knows which global state might influence them or depend on their global state.

— Thorsten Ball, RFC 348: Lack of conventions

It became hard to find out what tooling was available, how each script was configured, and how to extend them and add to them - hindering progress in our tooling.

That’s why we started sg, Sourcegraph’s developer tool, to become the centralised home for all development tasks.

sg started as a single command to run Sourcegraph locally in March 2021 - today it features over 60 commands covering all sorts of functionality and utilities that you might need throughout your development lifecycle:

dev environment setup
linters
RFC/ADR browser
migrations tooling
CI status checker, flakes investigation tooling, etc.
monitoring tooling
and more!

The tool is built in Go, and thus has the usual good Go stuff - it’s self-contained and portable, so it’s easy to build self-updating for. Installation is a simple one-liner, making sg very easy to distribute to teammates:

curl --proto '=https' --tlsv1.2 -sSLf https://install.sg.dev | sh

Introducing Go also enables more powerful, type-safe programming on top of just running commands - programming that is trickier to do in Bash, where you need to account for a more limited syntax and variants of unix commands and so on.

Using a CLI library with commands to represent tasks effectively encodes the available scripts in a powerful structured format, making documentation and configuration options easier to configure and access:

dbCommand = &cli.Command{
    Name:  "db",
    Usage: "Interact with local Sourcegraph databases for development",
    UsageText: `
# Reset the Sourcegraph 'frontend' database
sg db reset-pg

# Reset the 'frontend' and 'codeintel' databases
sg db reset-pg -db=frontend,codeintel

# Reset all databases ('frontend', 'codeintel', 'codeinsights')
sg db reset-pg -db=all

# Reset the redis database
sg db reset-redis

# Create a site-admin user whose email and password are foo@sourcegraph.com and sourcegraph.
sg db add-user -name=foo
`,
    Category: CategoryDev,
    Subcommands: []*cli.Command{
        {
            Name:        "reset-pg",
            Usage:       "Drops, recreates and migrates the specified Sourcegraph database",
            Description: `If -db is not set, then the "frontend" database is used (what's set as PGDATABASE in env or the sg.config.yaml). If -db is set to "all" then all databases are reset and recreated.`,
            Flags: []cli.Flag{
                &cli.StringFlag{
                    Name:        "db",
                    Value:       db.DefaultDatabase.Name,
                    Usage:       "The target database instance.",
                    Destination: &dbDatabaseNameFlag,
                },
            },
            Action: dbResetPGExec,
        },
    },
}

But to make this kind of tool effective, you need more than just converting scripts into a Go program. In developing sg, I’ve noticed some patterns come up that I believe are crucial to its utility - tooling should:

be approachable
work with your tools
codify standards

Tooling should be approachable

Firstly, tooling should be approachable, easy to learn and find out about, and easy to discover. The goal is to abstract implementation details away behind a friendly, usable interface.

For example, with documentation, you might want to meet your users where they are, and provide options for learning - whether it be through complete single-page references in the browser, or directly in the command line.

A structured CLI makes all this easy to generate from a single source of truth so that your documentation is available everywhere and always up-to-date.

Using the tool should be intuitive - to help with this, you can provide usability features like autocompletions, which in sg is configured for you during installation. This makes it easy to figure out what you can do on the fly!

When developing new sg commands, adding custom completions is also easy for commands that have a fixed set of possible arguments:

	BashComplete: cliutil.CompleteOptions(func() (options []string) {
		config, _ := getConfig()
		if config == nil {
			return
		}
		for name := range config.Commands {
			options = append(options, name)
		}
		return
	}),

Tooling should work with your tools

Secondly, tooling should interop and work with your tools - one of sg’s’ goals is specifically to not become a build system or container orchestrator, but to provide a uniform and programmable layer on top of them that is specific to Sourcegraph’s needs.

Take sg start, the command that replaced the goreman setup we talked about earlier, for example. sg start just uses whatever tools each service normally uses to build, run, and update itself, and provides some additional features on top that is specific to how Sourcegraph works. A service configuration might look like:

  oss-frontend:
    cmd: .bin/oss-frontend
    install: |
      if [ -n "$DELVE" ]; then
        export GCFLAGS='all=-N -l'
      fi
      go build -gcflags="$GCFLAGS" -o .bin/oss-frontend github.com/sourcegraph/sourcegraph-public-snapshot/cmd/frontend
    checkBinary: .bin/oss-frontend
    env:
      CONFIGURATION_MODE: server
      USE_ENHANCED_LANGUAGE_DETECTION: false
      # frontend processes need this to be so that the paths to the assets are rendered correctly
      WEBPACK_DEV_SERVER: 1
    watch:
      - lib
      - internal
      - cmd/frontend

You’re not constrained to using sg start - you can run all these steps yourself still with tools of your choice, but sg start combines everything for you into tidied up output, complete with configuration, colours, hot-reloading, and everything you might need to start experimenting with your new features!

Tooling should codify standards

Lastly, tooling should codify standards. Automation and scripting encodes best practices that, when shared, builds on past learnings to provide a smooth experience for everyone.

Consider the typical process of setting up your development environment, we’ve all been there - a big page of things to install and set up in certain ways:

### Prerequisites

- Install `A`
- Configure the thing
- Install `B`
- Install `C` (but not that version!)

Instead, at Sourcegraph we have sg setup, which automatically figures out what’s missing on your machine…

…and sg will take the steps required to get you set up!

Programming this fixes enables us to standardise installations over time, automatically addressing issues teammates run into so that future teammates won’t have to.

For example, we can configure PATH for you, or make sure things are installed in the right place and configured in the appropriate manner - building on top of other tool managers like Homebrew and asdf to provide a smooth experience.

Wrap-up

Enabling the development of good tooling, scripting, automation makes a difference. There’s a lot that can be done to improve how tooling is developed and improved, like the ideas I’ve brought up in this post - we don’t have to settle for cryptic tooling everywhere!

If you’re interested in how all this is implemented, sg is open source - come check us out on GitHub!

Note - I had originally hoped to present this as a lightning talk at Gophercon Chicago 2022, but I was too late to queue up on the day of the presentations, so I figured might as well turn it into a post.

About Sourcegraph

Interested in joining? We’re hiring!

Anatomy of a logger

2022-05-21T00:00:00+00:00

Zap is a structured logging library from Uber that is built on top of a “reflection-free, zero-allocation JSON encoder” to achieve some very impressions performance comapred to other popular logging libraries for Go. As part of developing integrations for it at Sourcegraph, I thought I’d take the time to look at what goes on under the hood.

Logging seems like a simple thing that should be tangential to your application’s concerns - how complicated could writing some output be? Why bother making logging faster at all? The first item in Zap’s FAQ provides a brief explanation:

Of course, most applications won’t notice the impact of a slow logger: they already take tens or hundreds of milliseconds for each operation, so an extra millisecond doesn’t matter.

On the other hand, why not make structured logging fast? […] Across a fleet of Go microservices, making each application even slightly more efficient adds up quickly.

In my personal experience, I’ve seen logging cause some very real issues - a debug statement I left in a Sourcegraph service once caused a customer instance to stall completely!

Metrics indicated jobs were timing out, and a look at the logs revealed thousands upon thousands of lines of random comma-delimited numbers. It seemed that printing all this junk was causing the service to stall, and sure enough setting the log driver to none to disable all output on the relevant service allowed the sync to proceed and continue. […] At scale these entries could contain many thousands of entries, causing the system to degrade. Be careful what you log!

At Sourcegraph we currently use the cheekily named log15 logging library. Of course, a faster logger likely would not have prevented the above scenario from occurring (though we are in the process of migrating to our new Zap-based logger), but here’s a set of (very unscientific) profiles that compare a somewhat “average” scenario of logging 3 fields with 3 fields of existing context in JSON format to demonstrate just how different Zap and log15 handles rendering a log entry behind the scenes:

const iters = 100000

var (
	thing1 = &thing{Field: "field1", Date: time.Now()}
	thing2 = &thing{Field: "field2", Date: time.Now()}
)

func profileZap(f *os.File) {
	// Create JSON format l with fields, normalised against log15 features
	cfg := zap.NewProductionConfig()
	cfg.Sampling = nil
	cfg.DisableCaller = true
	cfg.DisableStacktrace = true
	l, _ := zap.NewProduction()
	l = l.With(
		zap.String("1", "foobar"),
		zap.Int("2", 123),
		zap.Any("3", thing1),
	)

	// Start profile and log a lot
	pprof.StartCPUProfile(f)
	for i := 0; i < iters; i += 1 {
		l.Info("message",
			zap.String("4", "foobar"),
			zap.Int("5", 123),
			zap.Any("6", thing2),
		)
	}
	l.Sync()
	pprof.StopCPUProfile()
}

func profileLog15(f *os.File) {
	// Create JSON format l with fields
	l := log15.New(
		"1", "foobar",
		"2", 123,
		"3", thing1,
	)
	l.SetHandler(log15.StreamHandler(os.Stdout, log15.JsonFormat()))

	// Start profile and log a lot
	pprof.StartCPUProfile(f)
	for i := 0; i < iters; i += 1 {
		l.Info("message",
			"4", "foobar",
			"5", 123,
			"6", thing2,
		)
	}
	pprof.StopCPUProfile()
}

The resulting call graphs, generated using go tool pprof -prune_from=^os -png, with log15 on the left and Zap on the right:

Profiles showing CPU time spent throughout log calls, up until it reaches package os code where work begins to write data to disk - log15 is on the left, and zap is on the right. You might have to zoom in a bit.

Check out the pprof documentation for intepreting the callgraph to learn more.

It is not immediately evident how the Zap logger is supposed to be better than the log15 logger, since both finish running pretty quickly, have similar-looking call graphs, and ultimately have I/O as the major bottleneck (the big red os.(*.File).write blocks). However, a closer look (like, really close - you gotta zoom all the way in!) reveals a key hint - both loggers spend enough time in JSON encoding stages for the profiler to pick up, but the details of their JSON encoding differs somewhat:

log15 quickly delegates what appears to be the entire log entry to json.Marshal, which accounts for ~6ms.
Zap delegates fields to several different handlers: we see an AddString and AddReflected, where only the latter ends up in the json library, where it only accounts for ~2ms. Presumably, it is handling certain fields differently than others, where in some cases it skips encoding with the json library entirely!

Zap’s documentation provides a brief explanation of why delegating to json is an issue:

For applications that log in the hot path, reflection-based serialisation and string formatting are prohibitively expensive — they’re CPU-intensive and make many small allocations. Put differently, using encoding/json and fmt.Fprintf to log tons of interface{}s makes your application slow.

As a more scientific approach to demonstrating the benefits of Zap’s implementation, here’s a snapshot of the advertised benchmarks against some other popular libraries (as of v1.21.0), emphasis mine:

Log a message and 10 fields:

Package Time Time % to zap Objects Allocated

:zap: zap 2900 ns/op +0% 5 allocs/op

:zap: zap (sugared) 3475 ns/op +20% 10 allocs/op

zerolog 10639 ns/op +267% 32 allocs/op

go-kit 14434 ns/op +398% 59 allocs/op

logrus 17104 ns/op +490% 81 allocs/op

apex/log 32424 ns/op +1018% 66 allocs/op

log15 33579 ns/op +1058% 76 allocs/op

Log a message with a logger that already has 10 fields of context:

Package Time Time % to zap Objects Allocated

:zap: zap 373 ns/op +0% 0 allocs/op

:zap: zap (sugared) 452 ns/op +21% 1 allocs/op

zerolog 288 ns/op -23% 0 allocs/op

go-kit 11785 ns/op +3060% 58 allocs/op

logrus 19629 ns/op +5162% 70 allocs/op

log15 21866 ns/op +5762% 72 allocs/op

apex/log 30890 ns/op +8182% 55 allocs/op

Package	Time	Time % to zap	Objects Allocated
:zap: zap	2900 ns/op	+0%	5 allocs/op
:zap: zap (sugared)	3475 ns/op	+20%	10 allocs/op
zerolog	10639 ns/op	+267%	32 allocs/op
go-kit	14434 ns/op	+398%	59 allocs/op
logrus	17104 ns/op	+490%	81 allocs/op
apex/log	32424 ns/op	+1018%	66 allocs/op
log15	33579 ns/op	+1058%	76 allocs/op

Package	Time	Time % to zap	Objects Allocated
:zap: zap	373 ns/op	+0%	0 allocs/op
:zap: zap (sugared)	452 ns/op	+21%	1 allocs/op
zerolog	288 ns/op	-23%	0 allocs/op
go-kit	11785 ns/op	+3060%	58 allocs/op
logrus	19629 ns/op	+5162%	70 allocs/op
log15	21866 ns/op	+5762%	72 allocs/op
apex/log	30890 ns/op	+8182%	55 allocs/op

In these scenarios, log15 can be a whopping 10 to 50 times slower - very cool! Evidently Zap’s approach has impressive results, and we know roughly what it doesn’t do to achieve this performance - but how does it work in practice?

A writer for log entries

The README suggests the following as the preferred way to create and start using a Zap logger, which is very similar to what I do when I attempted to profile logging calls earlier:

logger, _ := zap.NewProduction()
defer logger.Sync()

Internally, this takes a default, high-level configuration and builds a logger from it using the following components:

a zapcore.Core, which is constructed from:
- a zapcore.Encoder
- a zapcore.WriteSyncer (also referred to as a “sink”)
a bunch of Options

For brevity, let’s forget about the Options for now and focus on the first component: zapcore.Core, which is described as the real logging interface beneath Zap, which exports the more traditional logging methods like .Info(), .Warn(), and so on - the equivalent of an io.Writer for structured logging instead of generic output.

zapcore.Core splits the logging of a message, such as .Info("message", fields...), into the following distinct steps:

Check: Check(Entry, *CheckedEntry) *CheckedEntry that determines if the message should be logged at all. This is where the traditional level filtering comes in (i.e. when you want to only log messages above a certain level, like discarding .Debug() messages), or discarding repeated messages through sampling.
1. In this interface, we get a read-only Entry and a mutable *CheckedEntry that a core registers itself onto if it decides the given Entry should be logged.
Write: Write(Entry, []Field) error, where the rendering of a log entry into the destination occurs.

In addition, we have distinct steps for:

Adding fields to the logger (as opposed to just a specific entry): With([]Field) Core - this allows Core implementations render fields once and not repeat work for subsequent log entries. We’ll get to how this works later!
1. It’s not noted on the interface documentation, but because of the above, the fields provided to With() are not provided to Write().
Flushing output: Sync() error allows for buffering output and batching writes together, minimising instances of being bottlenecked by I/O, or allowing Core implementations to handle logs in an asynchronous manner.

We can see this in action in the default *zap.Logger implementation. Let’s check out the seemingly innocuous .Info() function:

func (log *Logger) Info(msg string, fields ...Field) {
	if ce := log.check(InfoLevel, msg); ce != nil {
		ce.Write(fields...)
	}
}

Check

First up we have log.check, a whopping 102-line function that implements the check step of writing a log entry, which constructs an zapcore.Entry and calls the core.Check function:

func (log *Logger) check(lvl zapcore.Level, msg string) *zapcore.CheckedEntry {
	// ... omitted for brevity

	// Create basic checked entry thru the core; this will be non-nil if the
	// log message will actually be written somewhere.
	ent := zapcore.Entry{
		LoggerName: log.name,
		Time:       log.clock.Now(),
		Level:      lvl,
		Message:    msg,
	}
	ce := log.core.Check(ent, nil)

	// ...

	return ce
}

Note that log.core.Check(ent, nil) is pretty elaborate here - we noted previously that in this function, Core implementations should register themselves on the second argument CheckedEntry. How does that work if the CheckedEntry argument is a nil pointer? Taking a look at CheckedEntry.Write(), we can see the first hints of some pretty aggressive optimization:

// AddCore adds a Core that has agreed to log this CheckedEntry. It's intended to be
// used by Core.Check implementations, and is safe to call on nil CheckedEntry
// references.
func (ce *CheckedEntry) AddCore(ent Entry, core Core) *CheckedEntry {
	if ce == nil {
		ce = getCheckedEntry()
		ce.Entry = ent
	}
	ce.cores = append(ce.cores, core)
	return ce
}

var _cePool = sync.Pool{New: func() interface{} {
	// Pre-allocate some space for cores.
	return &CheckedEntry{
		cores: make([]Core, 4),
	}
}}

func getCheckedEntry() *CheckedEntry {
	ce := _cePool.Get().(*CheckedEntry)
	ce.reset()
	return ce
}

In short, CheckedEntry instances are created or reused on demand (this way, if no cores register themselves to write an Entry, no CheckedEntry is ever created) from a global sync.Pool:

A Pool is a set of temporary objects that may be individually saved and retrieved […] Pool’s purpose is to cache allocated but unused items for later reuse, relieving pressure on the garbage collector. […] Pool provides a way to amortise allocation overhead across many clients.

If many logs entries are written in a short time, allocated memory can be recycled by Pool, which is faster than having the Go runtime always allocate new memory and garbage-collecting unused CheckedEntry instances.

Write

Then we move on to the write step, done in ce.Write. This is the *zapcore.CheckedEntry we mentioned before performing a write on all registered cores:

func (ce *CheckedEntry) Write(fields ...Field) {
	if ce == nil {
		return
	}

	// ... omitted for brevity

	var err error
	for i := range ce.cores {
		err = multierr.Append(err, ce.cores[i].Write(ce.Entry, fields))
	}

	// ...

	putCheckedEntry(ce)

	// ...
}

func putCheckedEntry(ce *CheckedEntry) {
	if ce == nil {
		return
	}
	_cePool.Put(ce)
}

Note the call to putCheckedEntry - after the entry has been written, it is no longer needed, and this call places the entry into the entry for reuse. Nifty!

Sent into Write is still an Entry and Fields, however - we’ve yet to see how our message ends up as text, which is where the performance gains are supposed to be.

Encoding and writing output

Looking back, we have two components that are used to create a Core earlier on: zapcore.Encoder and zapcore.WriteSyncer.

	log := New(
		zapcore.NewCore(enc, sink, cfg.Level),
		cfg.buildOptions(errSink)...,
	)

Encoder exports a function, EncodeEntry, that seems to mirror the signature of Core.Write, and also embeds the ObjectEncoder interface:

// Encoder is a format-agnostic interface for all log entry marshalers. Since
// log encoders don't need to support the same wide range of use cases as
// general-purpose marshalers, it's possible to make them faster and
// lower-allocation.
type Encoder interface {
	ObjectEncoder

	// EncodeEntry encodes an entry and fields, along with any accumulated
	// context, into a byte buffer and returns it. Any fields that are empty,
	// including fields on the `Entry` type, should be omitted.
	EncodeEntry(Entry, []Field) (*buffer.Buffer, error)

	// ...
}

In ObjectEncoder we see the promise of a “reflection-free, zero-allocation JSON encoder” in the form of a giant interface, shortened for brevity:

// ObjectEncoder is a strongly-typed, encoding-agnostic interface for adding a
// map- or struct-like object to the logging context. Like maps, ObjectEncoders
// aren't safe for concurrent use (though typical use shouldn't require locks).
type ObjectEncoder interface {
	// Logging-specific marshalers.
	AddObject(key string, marshaler ObjectMarshaler) error

	// Built-in types.
	AddBool(key string, value bool)
	AddDuration(key string, value time.Duration)
	AddInt(key string, value int)
	AddString(key, value string)
	AddTime(key string, value time.Time)

	// AddReflected uses reflection to serialise arbitrary objects, so it can be
	// slow and allocation-heavy.
	AddReflected(key string, value interface{}) error

	// ...
}

This seemingly crazy interface allows messages to be incrementally built in the desired format without ever hitting json.Marshal. For example, we can look at what the JSON encoder does to add a string field:

func (enc *jsonEncoder) AddString(key, val string) {
	enc.addKey(key)
	enc.AppendString(val)
}

We start with adding the key, then the value:

func (enc *jsonEncoder) addKey(key string) {
	enc.addElementSeparator()
	enc.buf.AppendByte('"')
	enc.safeAddString(key)
	enc.buf.AppendByte('"')
	enc.buf.AppendByte(':')
}

Reading this carefully, given a key you’ll end up with the following being added to enc.buf (a bytes buffer):

"key":
^ ^ ^^
| | ||
| | |└ AppendByte(':')
| | └ AppendByte('"')
| └ safeAddString(key)
└ AppendByte('"')

Presumably what comes next is a value, for example a string:

func (enc *jsonEncoder) AppendString(val string) {
	enc.addElementSeparator()
	enc.buf.AppendByte('"')
	enc.safeAddString(val)
	enc.buf.AppendByte('"')
}

"key":"val"
      ^ ^ ^
      | | |
      | | |
      | | └ AppendByte('"')
      | └ safeAddString(val)
      └ AppendByte('"')

Encoding the entire entry in EncodeEntry works similarly, with your typical JSON opening and closing braces being written first:

final.buf.AppendByte('{')

// ... render log entry

final.buf.AppendByte('}')
final.buf.AppendString(final.LineEnding)

{"key":"val"}\n
^           ^ ^
|           | └ AppendString(final.LineEnding)
|           └ AppendByte('}')
└ AppendByte('{')

Stepping back up a bit, we can now better understand how zapcore.Field works, again condensed for brevity:

type Field struct {
	Key       string
	Type      FieldType
	Integer   int64
	String    string
	Interface interface{}
}

func (f Field) AddTo(enc ObjectEncoder) {
	var err error
	switch f.Type {
	case ObjectMarshalerType:
		err = enc.AddObject(f.Key, f.Interface.(ObjectMarshaler))
	case BoolType:
		enc.AddBool(f.Key, f.Integer == 1)
	case DurationType:
		enc.AddDuration(f.Key, time.Duration(f.Integer))
	case StringType:
		enc.AddString(f.Key, f.String)
	case ReflectType:
		err = enc.AddReflected(f.Key, f.Interface)

	// ...
	}

	// ...
}

Here we can see that for most cases, when one creates a strongly typed field with e.g. zap.String(key string, val string) Field, Zap can track the type information and pass the Field directly to the most appropriate function on the underlying encoder. Together with the fact that the entire log message is constructed incrementally, this means that it’s possible for most log messages to never encounter the need to reflect or use the json package to serialise the message. Nifty! This explains why we spend less time in json in the profile at the start of this post - most of the log message can be serialised directly, except for one field:

l.Info("message",
	zap.String("4", "foobar"),
	zap.Int("5", 123),
	zap.Any("6", thing2), // this goes to AddReflected, which uses JSON to marshal the field
)

To get around this, we could implement ObjectMarshaler which we saw on the Encoder interface previously. If implemented, we can serialise our object directly in an efficient manner:

type thing struct {
	Field string
	Date  time.Time
}

func (t *thing) MarshalLogObject(enc zapcore.ObjectEncoder) error {
	enc.AddString("Field", t.Field)
	enc.AddTime("Date", t.Date)
	return nil
}

We can re-run the profiling script from the start of the post to see that there’s no more usage of json!

Going back a bit, we can see that this also simplifies the encoding of fields that are added to the logger itself in the Core.WithFields we saw earlier by looking at the ioCore.With implementation, which immediately encodes the given fields:

func (c *ioCore) With(fields []Field) Core {
	clone := c.clone()
	for i := range fields {
		fields[i].AddTo(enc)
	}
	return clone
}

EncodeEntry checks if there are fields already encoded, and adds the partial JSON into the message directly - no additional work needed.

tl;dr

Turns out, seemingly simple things can be kind of complicated! However, in this case the result is a neat exhibit of a variety of optimization techniques and a logging implementation that can outpace other libraries by an order of magnitude.

Zap’s design also provides some interesting ways to hook into its behaviour - Zap itself offers some examples, such as zaptest, which creates a logger with a custom Writer that sends output to Go’s standard testing library.

At Sourcegraph, our new Zap-based logger offers utilities to hook into an our configured logger using Zap’s WrapCore API to assert against log output (mostly for testing the log library itself), partly built on the existing zaptest utilities. We’re also working on custom Core implementations to automatically send logged errors to Sentry, and we wrap Field constructors to define custom behaviours (we disallow importing directly from Zap for this reason). Pretty nifty to still have such a high degree of customizability in an implementation so focused on optimizations!

About Sourcegraph

Interested in joining? We’re hiring!

Dynamic and stateless Kubernetes Jobs for stable CI

2022-04-18T00:00:00+00:00

Sourcegraph’s continuous integration infrastructure uses Buildkite, a platform for running pipelines on CI agents we operate. After using the default approach of scaling persistent agent deployments for a long time, we’ve recently switched over to completely stateless agents on dynamically dispatched Kubernetes Jobs to improve the stability of our CI pipelines.

In Buildkite, events (such as a push to a repository) trigger “builds” on a “pipeline” that consist of multiple “jobs”, each of which correspond to a “pipeline step”. This is all of which is managed by the hosted Buildkite service, which then dispatches Buildkite jobs onto any Buildkite agents that are live on our infrastructure that meet each job’s “queue” requirements.

Previously, our Buildkite agent fleet was operated as a simple Kubernetes Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: buildkite-agent
  # ...
spec:
  replicas: 5
  # ...
  template:
    metadata:
      # ...
    spec:
      containers:
        - name: buildkite-agent
          # ...

A separate deployment, running a custom service called buildkite-autoscaler, would poll the Buildkite API for a list of running and schedule jobs and scale the fleet accordingly by making a Kubernetes API call to update the spec.replicas value in the base Deployment:

sequenceDiagram
    participant ba as buildkite-autoscaler
    participant k8s as Kubernetes
    participant bk as Buildkite

    loop
        ba->>bk: list running, pending jobs
        activate bk
        bk-->>ba: job queue counts
        deactivate bk

        activate ba
        ba->>ba: determine desired agent count

        ba->>k8s: get Deployment 
        deactivate ba
        activate k8s
        k8s-->>ba: active Deployment
        ba->>k8s: list Deployment Pods
        k8s-->>ba: active Pods
        deactivate k8s

        ba->>k8s: set spec.replicas to desired
    end

As long as there are jobs in the Buildkite queue, deployed agent pods would remain online until the autoscaler deems it appropriate to scale down. As such, multiple jobs could be dispatched onto the same agent before the fleet gets scaled down.

While Buildkite has mechanisms for mitigating state issues across jobs, and most Sourcegraph pipelines have cleanup and best practices for mitigating them as well, we occasionally still run into “botched” agents. These are particularly prevalent in jobs where tools are installed globally, or Docker containers are started but not correctly cleaned up (for example, if directories are moounted), and so on. We’ve also had issues where certain pods encounter network issues, causing them to fail all the jobs they accept. We also have jobs work “by accident”, especially in some of our more obscure repositories, where jobs rely on tools being installed by other jobs, and suddenly stop working if they land on a “fresh” agent, or those tools get upgraded unexpected.

All of these issues eventually lead us to decide to build a stateless approach to running our Buildkite agents.

Preparing for the switch

The main Sourcegraph mono-repository, sourcegraph/sourcegraph, uses generated pipelines that create pipelines on the fly for Buildkite. Thanks to this, we could easily implement a flag within the generator to redirect builds to the new agents on a gradual basis.

var FeatureFlags = featureFlags{
	StatelessBuild: os.Getenv("CI_FEATURE_FLAG_STATELESS") == "true" ||
		// Roll out to 50% of builds
		rand.NewSource(time.Now().UnixNano()).Int63()%100 < 50,
}

This feature flag could be used to apply queue configuration and environment variables on builds, allowing us to easily test out larger loads on the new agents and roll back changes with ease.

Static Kubernetes Jobs

The initial approach undertaken by the team used a single persistent Kubernetes Job. Agents would start up with --disconnect-after-job, indicating that they should consume a single job from the queue and immediately disconnect.

A new autoscaler service, job-autoscaler, was set up that pretty much did the exact same thing as the old buildkite-autoscaler, but instead of adjusting spec.replicas, it updated spec.parallelism instead, setting spec.completions and spec.backoffLimit to arbitrarily large values to prevent the Job from ever completing and shutting down.

This initial approach was used to iterate on some refinements to our pipelines to accommodate stateless agents (namely improved caching of resources). Upon rolling this out on a larger scale, however, we immediately ran into issues resulting in major CI outages, after which I outlined my thoughts in sourcegraph#32843 dev/ci: stateless autoscaler: investigate revamped approach with dynamic jobs. It turns out, we probably should not be applying a stateful management approach (scaling a single Job entity up and down) to what should probably be a stateless queue processing mechanism. I decided to take point on re-implementing our approach.

Dynamic Kubernetes Jobs

In sourcegraph#32843 I proposed an approach where we dispatch agents by creating new Kubernetes Jobs with spec.parallelism and spec.completions set to roughly number of agents needed to process all the jobs within the Buildkite jobs queue. This would mean that as soon as all the agents within a dispatched Job are “consumed” (have processed a Buildkite job and exited), Kubernetes can clean up the Job and related resources, and that would be that. If more agents are needed, we simply keep dispatching more Jobs. This is done by a new service called buildkite-job-dispatcher.

Luckily, all the setup has been done for stateless agents with the existing Buildkite Job, so the way the dispatcher works is by fetching the deployed Job, resetting a variety of fields used internally by Kubernetes:

in metadata: UID, resource version, and labels
in the Job spec: selector and template.metadata.labels

Making a few changes:

setting parallelism = completions = number of jobs in queue + buffer
- this means that we are dispatching agents to consume the queue, and exit when done
setting activeDeadlineSeconds, ttlSecondsAfterFinished to reasonable values
- activeDeadlineSeconds prevents stale agents from sitting around for too long in case, for example, a build gets cancelled
- ttlSecondsAfterFinished ensures resources are freed after use
adjusting the BUILDKITE_AGENT_TAGS environment variable on the Buildkite agent container

And deploying the adjusted spec as a new Job!

sequenceDiagram
    participant ba as buildkite-job-dispatcher
    participant k8s as Kubernetes
    participant bk as Buildkite
    participant gh as GitHub

    loop
      gh->>bk: enqueue jobs
      activate bk

      ba->>bk: list queued jobs and total agents
      bk-->>ba: queued jobs, total agents

      activate ba
      ba->>ba: determine required agents 
      alt queue needs agents
        ba->>k8s: get template Job
        activate k8s
        k8s-->>ba: template Job
        deactivate k8s

        ba->>ba: modify Job template

        ba->>k8s: dispatch new Job
        activate k8s
        k8s->>bk: register agents
        bk-->>k8s: assign jobs to agents

        loop while % of Pods not online or completed
          par deployed agents process jobs
            k8s-->>bk: report completed jobs
            bk-->>gh: report pipeline status
            deactivate bk
          and check previous dispatch
            ba->>k8s: list Pods from dispatched Job
            k8s-->>ba: Pods states
          end
        end
      end
      deactivate ba

      k8s->>k8s: Clean up completed Jobs

      deactivate k8s
    end

As noted in the diagram above, there’s also a “cooldown” mechanism where the dispatcher waits for the previous dispatch to roll out at least partially before dispatching a new Job to account for delays in our infrastructure. Without it, the dispatcher could continuously create new agents as the visible agent count appears low, leading to overprovisioning. We do this by simply listing the Pods associated with the most recently dispatched Job, which is easy enough to track within the dispatcher.

Observability

buildkite-job-dispatcher runs on a loop, with each run associated with a dispatchID, a simplified UUID with all special character removed. Everything that happens within a dispatch iteration is associated with this ID, starting with log entries, built on go.uber.org/zap:

import "go.uber.org/zap"

func (d *Dispatcher) run(ctx context.Context, k8sClient *k8s.Client, dispatchID string) error {
	// Allows us to key in on a specifc dispatch run when looking at logs
	runLog := d.log.With(zap.String("dispatchID", dispatchID))
	runLog.Debug("start run", zap.Any("config", config))
	// {"msg":"start run","dispatchID":"...","config":{...}}
}

Dispatched agents have the dispatch ID attached to their name and labels as well:

apiVersion: batch/v1
kind: Job
metadata:
  annotations:
    description: Stateless Buildkite agents for running CI builds.
    kubectl.kubernetes.io/last-applied-configuration: # ...
  creationTimestamp: "2022-04-18T00:04:34Z"
  labels:
    app: buildkite-agent-stateless
    dispatch.id: 3506b2adb17945d7b690bd5f9e6a6fb0
    dispatch.queues: stateless_standard_default_job

This means that when something unexpected happens - for example, when agents are underpovisioned or overprovisioned, we can easily look at the Jobs dispatched and link back to the log entries associated with their creation:

The dispatcher’s structured logs also allow us to leverage Google Cloud’s log-based metrics by generating metrics from numeric fields within log entries. These metrics form the basis for our at-a-glance overview dashboard of the state of our Buildkite agent fleet and how the dispatcher is responding to demand, as well as alerting for potential issues (for example, if Jobs are taking too long to roll out).

Based on these metrics, we can make adjustments to the numerous knobs available for fine-tuning the behaviour of the dispatcher: target minimum and maximum agents, the frequency of polling, the ratio of agents to require to come online before starting a new dispatch, agent TTLs, and more.

Git mirror caches

During the initial stateless agent implementation, my teammates @jhchabran and @davejrt developed some nifty mechanisms for caching asdf (a tool management tool) and Yarn dependencies. It uses a Buildkite plugin for caching under the hood, and exposes a simple API for use with Sourcegraph’s generated pipelines:

func withYarnCache() buildkite.StepOpt {
	return buildkite.Cache(&buildkite.CacheOptions{
		ID:          "node_modules",
		Key:         "cache-node_modules-{{ checksum 'yarn.lock' }}",
		RestoreKeys: []string{"cache-node_modules-{{ checksum 'yarn.lock' }}"},
		Paths:       []string{"node_modules", /* ... */},
		Compress:    false,
	})
}

func addPrettier(pipeline *bk.Pipeline) {
	pipeline.AddStep(":lipstick: Prettier",
		withYarnCache(),
		bk.Cmd("dev/ci/yarn-run.sh format:check"))
}

A lingering problem continued to be the initial clone step, however, especially in the main sourcegraph/sourcegraph monorepo, which can take upwards of 30 seconds to perform a shallow clone. We can’t entirely depend on shallow clones either, since our pipeline generator depends on performing diffs against our main branch to determine how to construct a pipeline. This is especially painful for short steps, where the time to run a linter check might be around the same amount of time it takes to perform a clone.

Buildkite supports a feature that allows all jobs on a single host to share a single git clone, using git clone --mirror. Subsequent clones after the initial clone can leverage the mirror repository with git clone --reference:

If the reference repository is on the local machine, […] obtain objects from the reference repository. Using an already existing repository as an alternate will require fewer objects to be copied from the repository being cloned, reducing network and local storage costs.

On our old stateless agents, this means that while some jobs can take the same 30 seconds to clone the repository, most jobs that land on “warm” agents will have a much faster clone time - roughly 5 seconds.

To recreate this feature on our stateless agents, I created a daily cron job that:

Creates a disk in Google Cloud, with gcloud compute disks create buildkite-git-references-"$BUILDKITE_BUILD_NUMBER"
Deploys a Kubernetes PersistentVolume and PersistentVolumeClaim corresponding to the new disk
Deploys a Kubernetes Job that mounts the generated PersistentVolumeClaim and creates a clone mirror
Updates the PersistentVolumeClaim to be labelled state: ready

We generate resources to deploy using envsubst <$TEMPLATE >$GENERATED on a template spec. For example, the PersistentVolume template spec looks like:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: buildkite-git-references-$BUILDKITE_BUILD_NUMBER
  namespace: buildkite
  labels:
    deploy: buildkite
    for: buildkite-git-references
    state: $PV_STATE
    id: '$BUILDKITE_BUILD_NUMBER'
spec:
  accessModes:
    - ReadWriteOnce
    - ReadOnlyMany
  claimRef:
    name: buildkite-git-references-$BUILDKITE_BUILD_NUMBER
    namespace: buildkite
  gcePersistentDisk:
    fsType: ext4
    # the disk we created with 'gcloud compute disks create'
    pdName: buildkite-git-references-$BUILDKITE_BUILD_NUMBER
  capacity:
    storage: 16G
  persistentVolumeReclaimPolicy: Delete
  storageClassName: buildkite-git-references

PersitentVolumes are created with accessModes: [ReadWriteOnce, ReadOnlyMany] - the idea is that we will mount it as ReadWriteOnce to populate the disk with a mirror repository, before allowing all our agents to mount the disk as ReadOnlyMany:

apiVersion: batch/v1
kind: Job
metadata:
  name: buildkite-git-references-populate
  namespace: buildkite
  annotations:
    description: Populates the latest buildkite-git-references disk with data.
spec:
  parallelism: 1
  completions: 1
  ttlSecondsAfterFinished: 240 # allow us to fetch logs
  template:
    metadata:
      labels:
        app: buildkite-git-references-populate
    spec:
      containers:
        - name: populate-references
          image: alpine/git:v2.32.0
          imagePullPolicy: IfNotPresent
          command: ['/bin/sh']
          args:
            - '-c'
            # Format:
            # git clone git@github.com:sourcegraph/$REPO /buildkite-git-references/$REPO.reference;
            - |
              mkdir /root/.ssh; cp /buildkite/.ssh/* /root/.ssh/;
              git clone git@github.com:sourcegraph/sourcegraph.git \
                /buildkite-git-references/sourcegraph.reference;
              echo 'Done';
          volumeMounts:
            - mountPath: /buildkite-git-references
              name: buildkite-git-references
      restartPolicy: OnFailure
      volumes:
        - name: buildkite-git-references
          persistentVolumeClaim:
            claimName: buildkite-git-references-$BUILDKITE_BUILD_NUMBER

The buildkite-job-dispatcher can now simply list all the available PersistentVolumeClaims that are ready:

var gitReferencesPVC *corev1.PersistentVolumeClaim
var listGitReferencesPVCs corev1.PersistentVolumeClaimList
if err := k8sClient.List(ctx, config.TemplateJobNamespace, &listGitReferencesPVCs,
  k8s.QueryParam("labelSelector", "state=ready,for=buildkite-git-references"),
); err != nil {
  runLog.Error("failed to fetch buildkite-git-references PVCs", zap.Error(err))
} else {
  gitReferencesPVCs := PersistentVolumeClaims(listGitReferencesPVCs.GetItems())
  pvcCount := zapMetric("pvcs", len(gitReferencesPVCs))
  if len(gitReferencesPVCs) > 0 {
    sort.Sort(gitReferencesPVCs)
    gitReferencesPVC = gitReferencesPVCs[0]
  } else {
    runLog.Warn("no buildkite-git-references PVCs found", pvcCount)
  }
}

And apply it to the agent Jobs we dispatch:

if gitReferencePVC != nil {
  job.Spec.Template.GetSpec().Volumes = append(job.Spec.Template.GetSpec().GetVolumes(),
    &corev1.Volume{
      Name: stringPtr("buildkite-git-references"),
      VolumeSource: &corev1.VolumeSource{
        PersistentVolumeClaim: &corev1.PersistentVolumeClaimVolumeSource{
          ClaimName: gitReferencePVC.GetMetadata().Name,
          ReadOnly:  boolPtr(true),
        },
      },
    })
  agentContainer.VolumeMounts = append(agentContainer.GetVolumeMounts(),
    &corev1.VolumeMount{
      Name:      stringPtr("buildkite-git-references"),
      ReadOnly:  boolPtr(true),
      MountPath: stringPtr("/buildkite-git-references"),
    })
}

And that’s it! We now have repository clone times that are consistently within the 3-7 seconds range, depending on how much your branch has diverged from main. As new disks become available, newly dispatched agents will automatically leverage more up-to-date mirror repositories.

Within the same daily cron job that deploys these disks, we can also prune disks that are no longer used by any agents:

kubectl describe pvc -l for=buildkite-git-references,id!="$BUILDKITE_BUILD_NUMBER" |
  grep -E "^Name:.*$|^Used By:.*$" | grep -B 2 "" | grep -E "^Name:.*$" |
  awk '$2 {print$2}' |
  while read -r vol; do kubectl delete pvc/"${vol}" --wait=false; done

Interestingly enough, there is no way to easily detect if a PersistentVolumeClaim is completely unused. We can detect unbound disks easily, but that doesn’t mean the same thing - in this setup PersistentVolumes are always bound, even when that PersistentVolumeClaim may or may not be in use. kubectl describe has this information though¹, which is what the above script (based on this StackOverflow answer) uses.

Stateless agents

So far, we have already seen a drastic reduction in tool-related flakes in CI, and the switch to stateless agents has helped us maintain confidence that issues are related to botched state and poor isolation. There are probably other mechanisms for maintaining isolation between builds, but for our case this seemed to have the easiest migration path.

About Sourcegraph

Interested in joining? We’re hiring!

A quick Sourcegraph search for "Used By" quickly reveals this line as the source of the output. A custom getPodsForPVC is the source of the pods listed here, and looking for references reveals that no kubectl command exposes this functionality except kubectl describe, so lengthy script it is! ↩

Extending Sourcegraph search

2022-04-10T00:00:00+00:00

Sourcegraph recently held a brief internal hackathon where we got to work on a variety of ideas related to our freshly minted “Sourcegraph use cases”. One idea that was raised was extending Sourcegraph’s core code search functionality to allow queries over search notebooks, a new product that enables live and persistent documentation based on code search, to aid in content discovery for onboarding.

The minimum viable product of this project was to implement the ability to do the following search within the Sourcegraph search language:

type:notebook my notebook query select:notebook.block.md
_____________ _________________ ________________________
       |                |                 └ render Markdown sections of the notebook match
       |                └ query string
       └ type filter

And render search notebooks (and/or selected “blocks”, or sections) within search results! For some context, this is what Sourcegraph’s code search results usually look like:

And this is what search notebooks look like, with each section being a separate notebook block:

In this post, I’ll walk through a brief overview of what I learned about how Sourcegraph search works and what we did to implement an additional search and search result type!

Introducing a search job
Sending results over the wire
Querying the database for real results
Implementing notebook blocks results
Rendering search notebook results

A sneak peak of the end result:

End-to-end notebook block search!

Note that all the code internals mentioned in this post may change - you can view the Sourcegraph repository at 73a484e for a accurate picture of what the codebase looked like at the time! I’d also like to thank @tsenart who both proposed the original idea and worked with me through several brainstorming sessions to discuss the implementation.

Additionally, I am basically a complete outsider when it comes to our search internals, and the search code I interact with in this post was built by Sourcegraph’s fantastic search teams, so kudos¹ to the teams for making this hack possible in the first place!

Introducing a search job

The Sourcegraph docs page Life of a search query briefly goes over what happens when, for example, you enter a query into sourcegraph.com/search:

A client makes a request to (typically) the /.api/stream endpoint - see how it is done in the raycast-sourcegraph extension for a simplified example.
The query makes its way to sourcegraph-frontend, which converts the query text into a search plan composed of search jobs to execute against various backends (such as Zoekt).
Jobs get executed and the results get streamed back over the wire to the client.

For example, a typical query foobar will evaluate to a plan of jobs like the following, calling out to a variety of search backends (ZoektGlobalSearch, RepoSearch, ComputeExcludedRepos) within certain limits², imposed by jobs for enforcing those limits on child jobs.

flowchart TB
0([TIMEOUT])
  0---1
  1[20s]
  0---2
  2([LIMIT])
    2---3
    3[500]
    2---4
    4([PARALLEL])
      4---5
      5([ZoektGlobalSearch])
      4---6
      6([RepoSearch])
      4---7
      7([ComputeExcludedRepos])

The typical example here is a search job that reaches out to our Zoekt backends. A Job could also combine multiple search jobs, such as to run a set of jobs in parallel or to prioritise results from certain jobs before others.

The evaluated search job varies based on your search query - an exhaustive commit search (foo type:commit count:all) will create the following job instead, with a longer timeout and higher limit:

flowchart TB
0([TIMEOUT])
  0---1
  1[1m0s]
  0---2
  2([LIMIT])
    2---3
    3[99999999]
    2---4
    4([PARALLEL])
      4---5
      5([Commit])
      4---6
      6([ComputeExcludedRepos])

Each search job within these plans are implemented behind the Job interface:

// Job is an interface shared by all individual search operations in the
// backend (e.g., text vs commit vs symbol search are represented as different
// jobs) as well as combinations over those searches (run a set in parallel,
// timeout). Calling Run on a job object runs a search.
type Job interface {
  Run(context.Context, database.DB, streaming.Sender) (*search.Alert, error)
  Name() string
}

So how do these jobs in the query plan get created? Poking around for constructors of the Job interface reveals (I think) the following flow for Job creation after a query.Plan is created (primarily with query.Pipeline, which handles query parsing, validation, transformation, and so on):

graph TD
  FromExpandedPlan --> ToEvaluateJob

  ToEvaluateJob --> ToSearchJob
  ToEvaluateJob -- "has pattern (AND or OR)" --> toPatternExpressionJob

  toPatternExpressionJob --> ToSearchJob
  toPatternExpressionJob --> toOrJob
  toPatternExpressionJob --> toAndJob

  toOrJob --> toPatternExpressionJob
  toAndJob --> toPatternExpressionJob

  ToSearchJob --> Job
  ToSearchJob -- has pattern --> optimizeJobs
  optimizeJobs --> Job

The ToSearchJob function, which appears to handle the bulk of creation of search jobs, with the additional layers applying a variety of processing.

// ToSearchJob converts a query parse tree to the _internal_ representation
// needed to run a search routine. To understand why this conversion matters, think
// about the fact that the query parse tree doesn't know anything about our
// backends or architecture. It doesn't decide certain defaults, like whether we
// should return multiple result types (pattern matches content, or a file name,
// or a repo name). If we want to optimise a Sourcegraph query parse tree for a
// particular backend (e.g., skip repository resolution and just run a Zoekt
// query on all indexed repositories) then we need to convert our tree to
// Zoekt's internal inputs and representation. These concerns are all handled by
// toSearchJob.
func ToSearchJob(jargs *Args, q query.Q, db database.DB) (Job, error) {
  b, err := query.ToBasicQuery(q)
  if err != nil {
    return nil, err
  }
  types, _ := q.StringValues(query.FieldType)
  resultTypes := search.ComputeResultTypes(types, b.PatternString(), jargs.SearchInputs.PatternType)

  // ...

  var requiredJobs, optionalJobs []Job
  addJob := func(required bool, job Job) {
    if required {
      requiredJobs = append(requiredJobs, job)
    } else {
      optionalJobs = append(optionalJobs, job)
    }
  }

  // ... various conditional calls to addJob
}

So to start off, we add a new field type result.TypeNotebook = "notebook", and attach a new Job when a query includes type: notebook:

if resultTypes.Has(result.TypeNotebook) {
  notebookSearchJob := &notebook.SearchJob{
    PatternString: b.PatternString(),
  }
  addJob(true, notebookSearchJob)
}

For now, we want to create a stub implementation that provides a few hard-coded notebooks that sends a few results over to the streaming.Sender provided in the (Job).Run interface. This requires implementing the result.Match interface:

type Match interface {
  ResultCount() int

  // Limit truncates the match such that, after limiting,
  // `Match.ResultCount() == limit`. It should never be called with
  // `limit <= 0`, since a single match cannot be truncated to zero results.
  Limit(int) int

  Select(filter.SelectPath) Match
  RepoName() types.MinimalRepo

  // Key returns a key which uniquely identifies this match.
  Key() Key
}

Right off the bat, it becomes clear that Sourcegraph’s search internals are heavily geared towards repository-oriented results, with the top-level RepoName being part of the Match interface. Repository matches, file content results, symbols, commits, diffs, and so on all return results that are part of a repository. Notebooks, on the other hand, are an entirely separate entity within the Sourcegraph application, and notebooks that are tracked in the database (it is also possible to create notebooks with .snb.md files within repositories, but we ignore that case for now) are not strictly associated with any repository.

This is even more evident within the Key type, which requires an unique combination Repo, Rev, Path, AuthorDate, Commit, Path, and TypeRank - none of which are fields that we can use to uniquely identify a search notebook. We could use Path as the notebook name, but that’s not strictly unique either.

To work around these issues for now, we just return a zero-value RepoName and add a new field ID to the Key type:

type Key struct {
  // ...

  // ID is an arbitrary identifier that can be used to distinguish this result,
  // e.g. if the result type is not associated with a repository.
  ID string

  // ...
}

type NotebookMatch struct {
  ID int64

  Title     string
  Namespace string
  Private   bool
  Stars     int
}

func (n NotebookMatch) RepoName() types.MinimalRepo {
  // This result type is not associated with any repository.
  return types.MinimalRepo{}
}

func (n NotebookMatch) Limit(limit int) int {
  // Always represents one result and limit > 0 so we just return limit - 1.
  return limit - 1
}

func (n *NotebookMatch) URL() *url.URL {
  return &url.URL{Path: "/notebooks/" + n.marshalNotebookID()}
}

func (n *NotebookMatch) Key() Key {
  return Key{
    ID:       n.marshalNotebookID(),
    TypeRank: rankRepoMatch,
  }
}

// other interface functions no-op for now

With our new types, we can create a stub job for searching search notebooks:

type SearchJob struct {}

func (s *SearchJob) Run(ctx context.Context, db database.DB, stream streaming.Sender) (*search.Alert, error) {
  stream.Send(streaming.SearchEvent{
    Results: result.Matches{
      &result.NotebookMatch{
        Title:     "FOOBAR",
        Namespace: "sourcegraph",
        ID:        1,
        Stars:     64,
        Private:   false,
      },
      &result.NotebookMatch{
        Title:     "BAZ",
        Namespace: "robert",
        ID:        2,
        Stars:     0,
        Private:   true,
      },
    },
  })
  return nil, nil
}

func (*SearchJob) Name() string { return "NotebookSearch" }

The workarounds above caused some funky behaviour, such as repository permissions post-processing rejecting notebook results as not being associated with a repository the current actor (user) has access to, so I just hacked in some a condition to ignore zero-value RepoNames in those checks to avoid dropping our notebook results.

We can test the evaluation of the query type:notebook select:notebook.block.md foobar to see our new search job type being registered (after implementing the appropriate printers):

flowchart TB
0([TIMEOUT])
  0---1
  1[20s]
  0---2
  2([LIMIT])
    2---3
    3[500]
    2---4
    4([SELECT])
      4---5
      5[notebook.block.md]
      4---6
      6([PARALLEL])
        6---7
        7([NotebookSearch])
        6---8
        8([ComputeExcludedRepos])

In this case, the select: term is just thrown in to demonstrate that it’s a job that occurs on top of a child job, which contains the NotebookSearch job we created. This will be important later)!

Sending results over the wire

That’s not the end of it! Distinct from plans, jobs, and matches, we also have event types, which are the types that get transmitted over the wire to search clients.

For the most part, this is a very thin layer that just simplifies the internal match types for consumption, and hydrates events with repository metadata from a cache (such how many stars the associated repository has, and when the repository was last updated) or decorations. For our new notebook results, we don’t really need to support any of that yet - we can simply map results more or less directly to a new event type.

func fromNotebook(notebook *result.NotebookMatch) *streamhttp.EventNotebookMatch {
  return &streamhttp.EventNotebookMatch{
    Type:      streamhttp.NotebookMatchType,
    ID:        notebook.Key().ID,
    Title:     notebook.Title,
    Namespace: notebook.Namespace,
    URL:       notebook.URL().String(),
    Stars:     notebook.Stars,
    Private:   notebook.Private,
  }
}

At this point, we basically have everything we need to see our results in the API results! We can confirm by spinning up Sourcegraph locally with sg start, executing a search, and inspecting the response of the network request to /.api/stream within a browser for our placeholder notebook results:

Look closely at the 'matches' entry for our hard-coded notebooks!

Querying the database for real results

Notebooks live in the Sourcegraph database, so to replace our stub results we can make a query to look for notebooks that returns relevant matches based on the provided query string.

SELECT
  notebooks.id,
  notebooks.title,
  NOT public as private, -- invert for consistency with other match types

  -- apply post-processing after query to merge namespace_user and  namespace_org into a
  -- single 'Namespace' field (only one can be set at a time)
  users.username as namespace_user,
  orgs.name as namespace_org,

  (
    SELECT COUNT(*)
    FROM notebook_stars
    WHERE notebook_id = notebooks.id
  ) as stars
FROM
  notebooks
  LEFT JOIN users on users.id = notebooks.namespace_user_id
  LEFT JOIN orgs on orgs.id = notebooks.namespace_org_id
WHERE
  (%s) -- permission conditions
  AND (%s) -- query conditions
ORDER BY
  stars DESC
LIMIT
  25

To generate query conditions, we use the notebook.SearchJob evaluated in ToSearchJob as the sole parameter. The idea is to extend SearchJob to contain all the parameters that can be used to adjust the generated query (such as pattern types, e.g. regexp, or additional fields, such as inclusion and exclusion of notebooks with notebook: and -notebook, and so on). For now, we generate simple queries solely based on the PatternString parameter:

func makeQueryConds(job *SearchJob) *sqlf.Query {
  conds := []*sqlf.Query{}

  // Allow querying against the 'full title'
  const concatTitleQuery = "CONCAT(users.username, orgs.name, notebooks.title)"
  if job.PatternString != "" {
    titleQuery := "%(" + job.PatternString + ")%"
    conds = append(conds, sqlf.Sprintf("%s ILIKE %s",
      concatTitleQuery, titleQuery))
  }

  if len(job.PatternString) > 0 {
    // Query against notebook contents, embedded as a tsvector field.
    conds = append(conds, sqlf.Sprintf("notebooks.blocks_tsvector @@ to_tsquery('english', %s)",
      toPostgresTextSearchQuery(job.PatternString)))
  }

  if len(conds) == 0 {
    // If no conditions are present, append a catch-all condition to avoid a SQL syntax error
    conds = append(conds, sqlf.Sprintf("1 = 1"))
  }

  return sqlf.Join(conds, "\n OR")
}

The CONCAT means that we cannot use indexes to hasten the query, but this is a hackathon so oh well. I decided to keep it in because I felt like a query for $namespace $topic felt like a very natural query to want to make, and I wanted to the demo supported that.

After writing a bit more boilerplate to execute the database query and scan the resulting rows, we can update our search job to return real results instead:

func (s *SearchJob) Run(ctx context.Context, db database.DB, stream streaming.Sender) (*search.Alert, error) {
  store := Search(db)
  notebooks, err := store.SearchNotebooks(ctx, s)
  if err != nil {
    return nil, errors.Wrap(err, "NotebookSearch")
  }
  matches := make([]result.Match, len(notebooks))
  for i, n := range notebooks {
    matches[i] = n
  }
  stream.Send(streaming.SearchEvent{
    Results: matches,
  })
  return nil, nil
}

We can test this out by creating a few notebooks in our local Sourcegraph instance and inspecting the network requests in-browser again to see real notebooks being returned!

Implementing notebook blocks results

Seeing the notebook titles that match your query is great and all, but to demonstrate the potential of this capability we wanted to make sure users can also see notebook content results - in other words, the matching notebook blocks - for their query.

For now, we decided to implement this such that notebook blocks only get returned with the select:notebook.block parameter. The Sourcegraph query language already features selections like select:repo or select:commit.diff.added, so this approach felt like it fitted in with how other search types are implemented.

Selections are part of the Match interface we previously implemented, and they work via selectJob, which wraps the streaming.Sender with another streaming.Sender that calls Select on each result it receives before passing it to the underlying stream.

This means that all we have to do is also query for blocks within our notebooks database query, and only expose the blocks within the Select implementation. To start off, we extend our NotebookMatch with a Blocks field, and implement Select such that we generate a new NotebookBlocksMatch type:

type NotebookMatch struct {
  // ... as before

  Blocks NotebookBlocks `json:"-"`
}

/// ... as before

func (n *NotebookMatch) Select(path filter.SelectPath) Match {
  // Only support 'select:notebook.*' on this result type
  if path.Root() != filter.Notebook {
    return nil
  }

  switch len(path) {
  case 1:
    return n // This is just 'select:notebook', so return self

  case 2, 3: // Support 'select:notebook.block' and 'select:notebook.block.*'
    if path[1] == "block" {
      if len(n.Blocks) == 0 {
        return nil // No results!
      }

      return (&NotebookBlocksMatch{
        Notebook: *n,
        Blocks:   n.Blocks,
      }).Select(path) // Allow blocks to continue selecting for 'select:notebook.block.*'
    }
  }

  return nil
}

To support select:notebook.blocks.$TYPE, where $TYPE is a block type (such as Markdown, query, symbol, and so on), the NotebookBlocksMatch type must also implement Select to only provide blocks of the requested type:

func (n *NotebookBlocksMatch) Select(path filter.SelectPath) Match {
  // Only support 'select:notebook.*' on this result type
  if path.Root() != filter.Notebook {
    return nil
  }

  switch len(path) {
  case 2:
    if path[1] == "block" {
      return n // This is just 'select:notebook.block', so return self
    }

  case 3:
    // Filter by the requested block type, which is the third path parameter. For example,
    // 'select:notebook.block.md' will filter for blocks of type 'md'.
    blockType := path[2]
    var blocks NotebookBlocks
    for _, b := range n.Blocks {
      if b["type"] == blockType {
        blocks = append(blocks, b)
      }
    }
    if len(blocks) == 0 {
      return nil // No results!
    }
    return &NotebookBlocksMatch{
      Notebook: n.Notebook,
      Blocks:   blocks,
    }
  }

  return nil
}

And as before, we need to implement an event type EventNotebookBlockMatch and the relevant adapters as well.

func fromNotebookBlocks(blocks *result.NotebookBlocksMatch) *streamhttp.EventNotebookBlockMatch {
  return &streamhttp.EventNotebookBlockMatch{
    Type:     streamhttp.NotebookBlockMatchType,
    Notebook: *fromNotebook(&blocks.Notebook),
    Blocks:   blocks.Blocks,
  }
}

For the database layer, we now need to add blocks to our result type. Blocks are currently store as a JSON blob within the notebooks.blocks column, so adding that to our SELECT and including it in the result scan is fairly straight-forward.

However, this does mean that we can’t only select relevant blocks within the database query. A better long-term solution to this is likely to split notebooks.blocks out into a separate table and joining it at query time, but that’s a lot of work for a hackathon so I decided to go for a cheap hack: post-filtering! This isn’t too bad for now because the notebooks.blocks_tsvector @@ to_tsquery in our query conditions means that the returned notebooks are likely to have a matching block, but it definitely isn’t very pretty.

Even worse, blocks of various types have varying shapes (i.e. there’s no single block.text field we can filter on), and I didn’t want to special-case each block type for now. A closer look at notebooks.blocks_tsvector reveals it is backed by a magic Postgres feature that indexes all fields of type string within the notebooks.blocks JSON:

ALTER TABLE
  notebooks
ADD
  COLUMN
IF NOT EXISTS
  blocks_tsvector TSVECTOR
GENERATED ALWAYS AS
  (jsonb_to_tsvector('english', blocks, '["string"]')) STORED;

It is a neat implementation that does not require any knowledge of blocks fields, but sadly there does not seem to be an equivalent function built with Go for us to post-filter with. So I just marshal each block as JSON and do a regexp search over the whole thing:

func (s *notebooksSearchStore) SearchNotebooks(ctx context.Context, job *SearchJob) ([]*result.NotebookMatch, error) {
  // ... query for notebooks

  // do our post-filtering
  if len(job.PatternString) > 0 {
    searchRe, err := regexp.Compile("(?i).*(" + job.PatternString + ").*")
    if err != nil {
      return nil, err
    }
    for _, n := range notebooks {
      var matchBlocks result.NotebookBlocks
      // filter notebook blocks
      for _, block := range n.Blocks {
        b, err := json.Marshal(block)
        if err != nil {
          continue
        }
        // regexp match over the marshalled block
        if searchRe.Match(b) {
          matchBlocks = append(matchBlocks, block)
        }
      }
      n.Blocks = matchBlocks
    }
  }

  return notebooks, nil
}

Hey, it’s a hackathon!

Similarly to before, we can verify this works end-to-end by running a type:notebook select:notebook.block query and inspecting the response:

Rendering search notebook results

Rendering results in the network tab is great and all, but we want to demo something pretty as well! We start off by adding types in the web app that correspond to our new event types:

export type SearchType = /* ... */ | 'notebook' | null
export type SearchMatch = /* ... */ | NotebookMatch | NotebookBlocksMatch

export interface NotebookMatch {
    type: 'notebook'
    id: string
    title: string
    namespace: string
    url: string
    stars?: number
    private: boolean
}

export interface NotebookBlocksMatch {
    type: 'notebook.block'
    notebook: NotebookMatch
    // TODO lots of variants of these types, leave as any for now and massage the data
    // as needed
    blocks: any[]
}

To extend type: completions in the search bar, we update FILTERS:

export const FILTERS: Record<NegatableFilter, NegatableFilterDefinition> &
    Record<Exclude<FilterType, NegatableFilter>, BaseFilterDefinition> = {
    /* ... */
    [FilterType.type]: {
        description: 'Limit results to the specified type.',
        discreteValues: () => [/* ... */, 'notebook'].map(value => ({ label: value })),
    },
    /* ... */
}

And similarly for select: completions, we update SELECTORS:

export const SELECTORS: Access[] = [
  /* ... */
  {
    name: 'notebook',
    fields: [
      {
        name: 'block',
        fields: [{ name: 'md' }, { name: 'query' }, { name: 'file' }, { name: 'symbol' }],
      },
    ],
  },
]

Suggestions!

And now things get a bit hacky. For plain notebook results, we can leverage the same components used for repository matches with reasonable results by extending the StreamingSearchResultsList component:

export const StreamingSearchResultsList: React.FunctionComponent<StreamingSearchResultsListProps> = ({
    /* ... */
}) => {
    /* ... */

    const renderResult = useCallback(
        (result: SearchMatch, index: number): JSX.Element => {
            switch (result.type) {
                /* ... */
                case 'notebook':
                    return (
                        <SearchResult
                            icon={NotebookIcon}
                            result={result}
                            repoName={`${result.namespace} / ${result.title}`}
                            platformContext={platformContext}
                            onSelect={() => logSearchResultClicked(index, 'notebook')}
                        />
                    )
            }
        }
    )

    return (/* ... */)
}

For notebook blocks, things started to get really hacky. I had originally expected to just render the parameters encoded in the block (for example, the query in a query block). However, @tsenart pointed out that maybe we could render the blocks exactly as it is rendered within a notebook. I thought this would be brilliant! Surely it would be as easy as simply importing the correct component and providing it with the blocks in a block match - how messy could this be?

Well, using NotebookComponent ended up looking like this:

  case 'notebook.block':
      return (
          <ResultContainer
              icon={NotebookIcon}
              title={
                  <Link to={result.notebook.url}>
                      {result.notebook.namespace} / {result.notebook.title}
                  Link>
              }
              collapsible={false}
              defaultExpanded={true}
              resultType={result.type}
              onResultClicked={noop}
              expandedChildren={
                  <div className={styles.notebookBlockResult}>
                      <NotebookComponent
                          key={`${result.notebook.id}-blocks`}
                          isEmbedded={true}
                          noRunButton={true}
                          // TODO HACK: DB, component, and GraphQL block types
                          // don't align so we need to massage it into a type
                          // this component finds acceptable
                          blocks={result.blocks.map(b => {
                              if (b.queryInput) {
                                  return { ...b, input: { query: b.queryInput.text } }
                              }
                              return {
                                  ...b,
                                  input:
                                      b.markdownInput || b.fileInput || b.symbolInput || b.computeInput,
                              }
                          })}
                          authenticatedUser={null}
                          globbing={false}
                          isReadOnly={true}
                          extensionsController={extensionsController}
                          hoverifier={hoverifier}
                          platformContext={platformContext}
                          exportedFileName={result.notebook.title}
                          onSerializeBlocks={noop}
                          onCopyNotebook={() => NEVER}
                          streamSearch={() => NEVER} // TODO make this jump to new search page instead
                          isLightTheme={isLightTheme}
                          telemetryService={telemetryService}
                          fetchHighlightedFileLineRanges={fetchHighlightedFileLineRanges}
                          searchContextsEnabled={searchContextsEnabled}
                          settingsCascade={settingsCascade}
                          isSourcegraphDotCom={isSourcegraphDotCom}
                          showSearchContext={showSearchContext}
                      />
                  div>
              }
          />
      )

Gnarly, eh? All these fields required me to do all sorts of things to StreamingSearchResultsListProps to get the props needed. Full disclaimer: I am far from a professional when it comes to web apps and React, so I’m sure there’s a better way to do this than prop drilling, but oh well. The NotebookComponent also doesn’t feel like it was meant for this kind of import and use, given notebooks is a pretty new product and the whole philosophy of iterate fast and polish later and all.

That said, once the compiler stopped complaining the results were great - everything kind of just worked, and looked pretty good after some CSS adjustments! Even running query blocks worked nicely.

Of course, this begs the question - what if you make a notebook search, within a search notebook? Well, that works too!

Search-notebooks-ception?

You can also check out a brief final demo I made of the state of the project at the end of the hackathon for how this all ties together:

You can also check out the (messy) (and incomplete) code here: sourcegraph#33316

Wrap-up

Thanks for reading! I hope this was an interesting glimpse at how search works at Sourcegraph. I’m not sure if this will ever make it into the product, but regardless, this was a really fun foray into a part of the codebase I’ve only interacted with at a surface level through my Sourcegraph for Raycast extension project, and learning about the abstractions used to power code search (and more!) was fascinating, and a nice change of pace from my usual work!

About Sourcegraph

Interested in joining? We’re hiring!

So somewhat embarrassingly, on one of my iterations of this project I complained a bit about the tedium of the many layers in the search backend, at which point I was educated by Comby (structural search) creator @rvantonder on how cleaning up the search internals is an ongoing effort and has improved significantly over the past year. One of my biggest takeaways from this project is that search a very complex system and that building a suitable abstraction for the myriad of types of search that Sourcegraph already features is a monumental undertaking! ↩
By default, Sourcegraph search is limited to optimise for fast results. This extensiveness of a search is configurable through the count: and timeout:, as well as a special count:all mode, as described in our documentation: Exhaustive search. ↩

Self-documenting and self-updating tooling

2022-02-20T00:00:00+00:00

In a rapidly moving organization, documentation drift is inevitable as the underlying tools undergoes changes to suit changing needs, especially for internal tools where leaning on tribal knowledge can often be more efficient in the short term. As each component grows in complexity, however, this introduces debt that makes for a confusing onboarding process, a poor developer experience, and makes building integrations more difficult.

One approach for keeping documentation debt at bay is to choose tools that come with automated writing of documentation built-in. You can design your code in such a way that code documentation generators can also double as user guides (which I explored with my rewrite of the UBC Launch Pad website’s generated configuration documentation), or specifications that can generate both code and documentation (which I tried with Inertia’s API reference). Some libraries, like Cobra, a Go library for build CLIs, can also generate reference documentation for commands (such as Inertia’s CLI reference). This allows you to meet your users where they are - for example, the less technically oriented can check out a website while the more hands-on users can find what they need within the code or in the command line - while maintaining a single source of truth that keeps everything up to date.

Of course, in addition to generated documentation you do still need to write documentation to tie the pieces together - for example, the UBC Launch Pad website still had a brief intro guide and we did put together a usage guide for Inertia, but generated documentation helps you ensure the nitty gritty stays up to date, and focus on high-level guidance in your handcrafted writing.

At Sourcegraph, I’ve been exploring avenues for taking this even further. Once you move away from off-the-shelf generators and invest in leveraging your code to generate exactly what you need, you can build a pretty neat ecosystem of not just documentation generators, but also interesting integrations and tooling that is always up to date by design. In this article, I’ll talk about some of the things we’ve built with this approach in mind: Sourcegraph’s observability ecosystem and continuous integration pipelines.

Observability ecosystem

The Sourcegraph product has shipped with Prometheus metrics and Grafana dashboards for quite a while, used both by Sourcegraph for Sourcegraph Cloud and by self-hosted customers to operate Sourcegraph instances. These have been created from our own Go-based specification since before I started working here. The spec would look something like this (truncated for brevity):

func GitServer() *Container {
	return &Container{
        Name:        "gitserver",
        Title:       "Git Server",
        Description: "Stores, manages, and operates Git repositories.",
        Groups: []Group{{
            Title: "General",
            Rows: []Row{{
                // Each dashboard panel and alert is associated with an "observable"
                Observable{
                    Name:        "disk_space_remaining",
                    Description: "disk space remaining by instance",
                    Query:       `(src_gitserver_disk_space_available / src_gitserver_disk_space_total)*100`,
                    // Configure Prometheus alerts
                    Warning: Alert{LessOrEqual: 25},
                    // Configure Grafana panel
                    PanelOptions: PanelOptions().LegendFormat("{{instance}}").Unit(Percentage),
                    // Some options, like this one, makes changes to both how the panel
                    // is rendered as well as when the alert fires
                    DataMayNotExist: true,
                    // Configure documentation about possible solutions if the alert fires
                    PossibleSolutions: `
                        - **Provision more disk space:** Sourcegraph will begin deleting...
                    `,
                },
            }},
        }},
    },
}

Explore what our monitoring generator looked like in Sourcegraph 3.17 (circa mid-2020)

From here, a program will import the definitions and generate the appropriate Prometheus recording rules, Grafana dashboard specs, and a simple customer-facing “alert solutions” page. Any changes that engineers made to their monitoring definitions using the specification would automatically update everything that needed to be updated, no additional work needed.

For example, the Grafana dashboard spec generation automatically calculates appropriate widths and heights for each panel you add, ensuring they are evenly distributed and include lines that indicate Prometheus alert thresholds, a uniform look and feel, and more.

I loved this idea, so I ran with it and worked on a series of changes that expanded the capabilities of this system significantly. Today, our monitoring specification powers:

Multiple reference pages: a revamped alerts reference and a page that focuses on background information about each dashboard panel, that both customers and engineers at Sourcegraph can reference. It now also includes information about which teams own what dashboards and alerts to help customer support better triage support requests and how to easily silence alerts through our new integration with Alertmanager.

Grafana dashboards that now automatically includes links to the generated documentation, annotation layers for generated alerts, improved alert overview graphs, and more.

Version and alert annotations in Sourcegraph's generated dashboards. Dashboard like these are automatically provided by defining observables using our monitoring specification, alongside everything else mentioned previously.

Prometheus integration that now generates more granular alert rules that include additional metadata such as the ID of the associated generated dashboard panel, the team that owns the alert, and more.
An entirely new Alertmanager integration (related blog post) that allows you to easily configure alert notifications via the Sourcegraph application, which automatically sets up the appropriate routes and configures messages to include relevant information for triaging alerts: a helpful summary, links to documentation, and links to the relevant dashboard panel in the time window of the alert. This leverages the aforementioned generated Prometheus metrics!

Automatically configured alert notification messages feature a helpful summary and links to diagnose the issue further for a variety of supported notification services, such as Slack and OpsGenie.

The API has changed as well to improve its flexibility and enable many of the features listed above. Nowadays, a monitoring specification might look like this (also truncated for brevity):

// Definitions are separated from the API so everything is imported from 'monitoring' now,
// which allows for a more tightly controlled API.
func GitServer() *monitoring.Container {
    return &monitoring.Container{
        Name:        "gitserver",
        Title:       "Git Server",
        Description: "Stores, manages, and operates Git repositories.",
        // Easily create template variables without diving into the underlying JSON spec
        Variables: []monitoring.ContainerVariable{{
            Label:        "Shard",
            Name:         "shard",
            OptionsQuery: "label_values(src_gitserver_exec_running, instance)",
            Multi:        true,
        }},
        Groups: []monitoring.Group{{
            Title: "General",
            Rows: []monitoring.Row{{
                {
                    Name:        "disk_space_remaining",
                    Description: "disk space remaining by instance",
                    Query:       `(src_gitserver_disk_space_available / src_gitserver_disk_space_total)*100`,
                    // Alerting API expanded with additional options to leverage more
                    // Prometheus features
                    Warning: monitoring.Alert().LessOrEqual(25).For(time.Minute),
                    Panel: monitoring.Panel().LegendFormat("{{instance}}").
                        Unit(monitoring.Percentage).
                        // Functional configuration API that allows you to provide a
                        // callback to configure the underlying Grafana panel further, or
                        // use one of the shared options to share common options
                        With(monitoring.PanelOptions.LegendOnRight()),
                    // Owners can now be defined on observables, which allows support
                    // to help triage customer queries and is used internally to route
                    // pager alerts
                    Owner: monitoring.ObservableOwnerCoreApplication,
                    // Documentation fields are still around, but an 'Interpretation' can
                    // now also be provided for more obscure background on observables,
                    // especially if they aren't tied to an alert
                    PossibleSolutions: `
                        - **Provision more disk space:** Sourcegraph will begin deleting...
                    `,
                },
            }},
        }},
    }
}           

Explore what our monitoring generator looks like today!

Since the specification is built on a typed language, the API itself is self-documenting in that authors of monitoring definitions can easily access what options are available and what each does through generated API docs or code intelligence available in Sourcegraph or in your IDE, making it very easy to pick up and work with.

Example Sourcegraph API docs of the monitoring API, though similar docs can also be generated by other language-specific tools.

We also now have a tool, sg, that enables us to spin up just the monitoring stack, complete with hot-reloading of Grafana dashboards, Prometheus configuration, and with a single command: sg start monitoring. You can even easily test your dashboards against production metrics! This is all enabled by having a single tool and set of specifications as the source of truth for all our monitoring integrations.

This all comes together to form a cohesive monitoring development and usage ecosystem that is tightly integrated, encodes best practices, self-documenting (both in the content it generates as well as the APIs available), and easy to extend.

Learn more about our observability ecosystem in our developer documentation, and check out the monitoring generator source code here.

Continuous integration pipelines

At Sourcegraph, our core continuous integration pipeline are - you guessed it - generated! Our pipeline generator program analyses a build’s variables (changes, branch names, commit messages, environment variables, and more) in order to create a pipeline to run on our Buildkite agent fleet.

Typically, Buildkite pipelines are specified similarly to GitHub Action workflows - by committing a YAML file to your repository that build agents pick up and run. This YAML file will specify what commands should get run over your codebase, and will usually support some simple conditions.

These conditions are not very ergonomic to specify, however, and will often be limited in functionality - so instead, we generate the entire pipeline on the fly:

steps:
  - group: "Pipeline setup"
    steps:
      - label: ':hammer_and_wrench: :pipeline: Generate pipeline'
        # Prioritise generating pipelines so that jobs can get generated and queued up as soon
        # as possible, so as to better assess pipeline load e.g. to scale the Buildkite fleet.
        priority: 10
        command: |
          echo "--- generate pipeline"
          go run ./enterprise/dev/ci/gen-pipeline.go | tee generated-pipeline.yml
          echo "--- upload pipeline"
          buildkite-agent pipeline upload generated-pipeline.yml

The pipeline generator has also been around at Sourcegraph since long before I joined, but I’ve since done some significant refactors to it, including refactoring some of its core functionality - what we call “run types” and “diff types”, which are used to determine the appropriate pipeline go generate for any given build. This allows us to do a ton of cool things.

First, some background on the technical details. A run type is specified as follows:

// RunTypeMatcher defines the requirements for any given build to be considered a build of
// this RunType.
type RunTypeMatcher struct {
    // Branch loosely matches branches that begin with this value, unless a different type
    // of match is indicated (e.g. BranchExact, BranchRegexp)
    Branch       string
    BranchExact  bool
    BranchRegexp bool
    // BranchArgumentRequired indicates the path segment following the branch prefix match is
    // expected to be an argument (does not work in conjunction with BranchExact)
    BranchArgumentRequired bool

    // TagPrefix matches tags that begin with this value.
    TagPrefix string

    // EnvIncludes validates if these key-value pairs are configured in environment.
    EnvIncludes map[string]string
}

When matched, a RunType = iota is associated with the build, which can then be leveraged to determine what kinds of steps to include. For example:

Pull requests run a bare-bones pipeline generated from what has changed in your pull requests (read on to learn more) - this enables us to keep feedback loops short on pull requests.
Tagged release builds run our full suite of tests, and publishes finalised images to our public Docker registries.
The main branch runs our full suite of tests, and publishes preview versions of our images to internal Docker registries. It also generates notifications that can notify build authors if their builds have failed in main.
Similarly, a “main dry run” run type is available by pushing to a branch prefixed with main-dry-run/ - this runs almost everything that gets run on main. Useful for double-checking your changes will pass when merged.
Scheduled builds are run with specific environment variables for browser extension releases and release branch health checks.

A search notebook walkthrough of how run types are used!

A “diff type” is generated by a diff detector that can work similarly to GitHub Action’s on.paths, but also enables a lot more flexibility. For example, we detect basic “Go” diffs like so:

if strings.HasSuffix(p, ".go") || p == "go.sum" || p == "go.mod" {
    diff |= Go
}

However, engineers can also define database migrations that might not change Go code - in these situations, we still want to run Go tests, and we also want to run migration tests. We can centralise this detection like this:

if strings.HasPrefix(p, "migrations/") {
    diff |= (DatabaseSchema | Go)
}

Our Diff = 1 << iota type is constructed by bit-shifting an iota type, so we can easily check for what diffs have been detected with diff&target != 0, which is done by a helper function, (*DiffType).Has.

A search notebook walkthrough of how diff types are used!

The programmatic generation approach allows for some complex step generation that would be very tedious to manage by hand. Take this example:

if diff.Has(changed.DatabaseSchema) {
    ops.Merge(operations.NewNamedSet("DB backcompat tests",
        addGoTestsBackcompat(opts.MinimumUpgradeableVersion)))
}

In this scenario, a group of checks (operations.NewNamedSet) is created to check that migrations being introduced are backwards-compatible. To make this check, we provide it MinimunUpgradeableVersion - a variable that is updated automatically the Sourcegraph release tool to indicate what version of Sourcegraph all changes should be compatible with. The tests being added look like this:

func addGoTestsBackcompat(minimumUpgradeableVersion string) func(pipeline *bk.Pipeline) {
    return func(pipeline *bk.Pipeline) {
        buildGoTests(func(description, testSuffix string) {
            pipeline.AddStep(
                fmt.Sprintf(":go::postgres: Backcompat test (%s)", description),
                bk.Env("MINIMUM_UPGRADEABLE_VERSION", minimumUpgradeableVersion),
                bk.Cmd("./dev/ci/go-backcompat/test.sh "+testSuffix),
            )
        })
    }
}

buildGoTests is a helper that generates a set of commands to be run against each of the Sourcegraph repository’s Go packages. It is configured to split out more complex packages into separate jobs so that they can be run in parallel across multiple agents. Right now, the generated commands for addGoTestsBackcompat look like this:

 • DB backcompat tests
      • :go::postgres: Backcompat test (all)
      • :go::postgres: Backcompat test (enterprise/internal/codeintel/stores/dbstore)
      • :go::postgres: Backcompat test (enterprise/internal/codeintel/stores/lsifstore)
      • :go::postgres: Backcompat test (enterprise/internal/insights)
      • :go::postgres: Backcompat test (internal/database)
      • :go::postgres: Backcompat test (internal/repos)
      • :go::postgres: Backcompat test (enterprise/internal/batches)
      • :go::postgres: Backcompat test (cmd/frontend)
      • :go::postgres: Backcompat test (enterprise/internal/database)
      • :go::postgres: Backcompat test (enterprise/cmd/frontend/internal/batches/resolvers)

With just the pretty minimal configuration above, each step is generated with a lot of baked-in configuration, many of which is generated automatically for every build step we have.

  - agents:
      queue: standard
    command:
    - ./tr ./dev/ci/go-backcompat/test.sh only github.com/sourcegraph/sourcegraph-public-snapshot/internal/database
    env:
      MINIMUM_UPGRADEABLE_VERSION: 3.36.0
    key: gopostgresBackcompattestinternaldatabase
    label: ':go::postgres: Backcompat test (internal/database)'
    timeout_in_minutes: "60"

In this snippet, we have:

A default queue to run the job on - this can be feature-flagged to run against experimental agents.
The shared MINIMUM_UPGRADEABLE_VERSION variable that gets used for other steps as well, such as upgrade tests.
A generated key, useful for identifying steps and creating step dependencies.
Commands prefixed with ./tr: this script creates and uploads traces for our builds!

Build traces help visualise and track the performance of various pipeline steps. Uploaded traces are automatically linked from builds via Buildkite annotations for easy reference, and can also be queried directly in Honeycomb.

Features like the build step traces was implemented without having to make sweeping changes pipeline configuration, thanks to the generated approach - we just had to adjust the generator to inject the appropriate scripting, and now it just works across all commands in the pipeline.

Additional functions are also available that tweak how a step is created. For example, with bk.AnnotatedCmd one can indicate that a step will generate annotations by writing to ./annotations - a wrapper script is configured to make sure these annotations gets picked up and uploaded via Buildkite’s API:

// AnnotatedCmd runs the given command, picks up files left in the `./annotations`
// directory, and appends them to a shared annotation for this job. For example, to
// generate an annotation file on error:
//
//	if [ $EXIT_CODE -ne 0 ]; then
//		echo -e "$OUT" >./annotations/shfmt
//	fi
//
// Annotations can be formatted based on file extensions, for example:
//
//  - './annotations/Job log.md' will have its contents appended as markdown
//  - './annotations/shfmt' will have its contents formatted as terminal output on append
//
// Please be considerate about what generating annotations, since they can cause a lot of
// visual clutter in the Buildkite UI. When creating annotations:
//
//  - keep them concise and short, to minimze the space they take up
//  - ensure they are actionable: an annotation should enable you, the CI user, to know
//    where to go and what to do next.
func AnnotatedCmd(command string, opts AnnotatedCmdOpts) StepOpt {
    var annotateOpts string
    // ... set up options
    // './an' is a script that runs the given command and uploads the exported annotations
    // with the given annotation options before exiting.
    annotatedCmd := fmt.Sprintf("./an %q %q %q",
        tracedCmd(command), fmt.Sprintf("%v", opts.IncludeNames), strings.TrimSpace(annotateOpts))
    return RawCmd(annotatedCmd)
}

The author of a pipeline step can then easily opt in to having their annotations uploaded by changing bk.Cmd(...) to bk.AnnotatedCmd(...). This allows all steps to easily create annotations by simply writing content to a file, and get them uploaded, formatted, and grouped nicely without having to learn the specifics of the Buildkite annotations API:

Annotations can help guide engineers to how to fix build issues.

The usage of iota types for both RunType and DiffType enables us to iterate over available types for some useful features. For example, turning a DiffType into a string gives a useful summary of what is included in the diff:

var allDiffs []string
ForEachDiffType(func(checkDiff Diff) {
    diffName := checkDiff.String()
    if diffName != "" && d.Has(checkDiff) {
        allDiffs = append(allDiffs, diffName)
    }
})
return strings.Join(allDiffs, ", ")

We can take that a bit further to iterate over all our run types and diff types in order to generate a reference page of what each pipeline does - since this page gets committed, it is also a good way to visualise changes to generated pipelines caused by code changes as well!

// Generate each diff type for pull requests
changed.ForEachDiffType(func(diff changed.Diff) {
    pipeline, err := ci.GeneratePipeline(ci.Config{
        RunType: runtype.PullRequest,
        Diff:    diff,
    })
    if err != nil {
        log.Fatalf("Generating pipeline for diff type %q: %s", diff, err)
    }
    fmt.Fprintf(w, "\n- Pipeline for `%s` changes:\n", diff)
    for _, raw := range pipeline.Steps {
        printStepSummary(w, "  ", raw)
    }
})

// For the other run types, we can also generate detailed information about what
// conditions trigger each run type!
for rt := runtype.PullRequest + 1; rt < runtype.None; rt += 1 {
    m := rt.Matcher()
    if m.Branch != "" {
        matchName := fmt.Sprintf("`%s`", m.Branch)
        if m.BranchRegexp {
            matchName += " (regexp match)"
        } else if m.BranchExact {
            matchName += " (exact match)"
        }
        conditions = append(conditions, fmt.Sprintf("branches matching %s", matchName))
        if m.BranchArgumentRequired {
            conditions = append(conditions, "requires a branch argument in the second branch path segment")
        }
    }
    if m.TagPrefix != "" {
        conditions = append(conditions, fmt.Sprintf("tags starting with `%s`", m.TagPrefix))
    }
    // etc.
}

A web version of this reference page is also published to the pipeline types reference. You can also check out the docs generation code directly!

Taking this even further, with run type requirements available we can also integrate run types into other tooling - for example, our developer tool sg can help you create builds of various run types from a command like sg ci build docker-images-patch to build a Docker image for a specific service:

// Detect what run-type someone might be trying to build
rt := runtype.Compute("", fmt.Sprintf("%s/%s", args[0], branch), nil)
// From the detected matcher, we can see if an argument is required and request it
m := rt.Matcher()
if m.BranchArgumentRequired {
    var branchArg string
    if len(args) >= 2 {
        branchArg = args[1]
    } else {
        branchArg, err = open.Prompt("Enter your argument input:")
        if err != nil {
            return err
        }
    }
    branch = fmt.Sprintf("%s/%s", branchArg, branch)
}
// Push to the branch required to trigger a build
branch = fmt.Sprintf("%s%s", rt.Matcher().Branch, branch)
gitArgs := []string{"push", "origin", fmt.Sprintf("%s:refs/heads/%s", commit, branch)}
if *ciBuildForcePushFlag {
    gitArgs = append(gitArgs, "--force")
}
run.GitCmd(gitArgs...)
// Query Buildkite API to get the created build
// ...

Using a similar iteration over the available run types we can also provide tooltips that automatically list out all the supported run types that can be created this way:

Check out the sg ci build source code directly, or the discussion behind the inception of this feature.

So now we have generated pipelines, documentation about them, the capability to extend pipeline specifications with additional feature like tracing, and tooling that is integrated and automatically kept in sync with pipeline specifications - all derived from a single source of truth!

Learn more about our continuous integration ecosystem in our developer documentation, and check out the pipeline generator source code here.

Wrap-up

The generator approach has helped us build a low-maintenance and reliable ecosystem around parts of our infrastructure. Tailor-making such an ecosystem is a non-trivial investment at first, but as an organization grows and business needs become more specific, the investment pays off by making systems easy to learn, use, extend, integrate, validate, and more.

Also, it’s a lot of fun!

About Sourcegraph

Interested in joining? We’re hiring!

Mirroring GitHub permissions at scale

2021-10-08T00:00:00+00:00

As a tool for searching over all your code, accurately mirroring repository permissions defined in the relevant code hosts is a core part of Sourcegraph’s functionality. Typically, the only way to do this is through the APIs of code hosts, though rate limits can mean it can take several weeks to work through a large number of users and repositories.

This article goes over some of the work I did on improving GitHub permissions mirroring at Sourcegraph, with the help of several co-workers - primarily Joe Chen (who wrote most of Sourcegraph’s original permissions mirroring code and helped me get up to speed - and is also the author of some big open-source projects like gogs/gogs and go-ini/ini) and Ben Gordon (who helped a ton on the customer-facing side of things).

GitHub rate limits

The GitHub API has a base rate limit of 5000 requests an hour. Let’s look at what it takes to provide access lists for a user: with page size limits of 100 items per page, iterating over all users can take can take up to the following number of requests, all of which should ideally fall under the rate limit constraints:

\[\dfrac{\text{users} \times \text{repositories}}{100} < 5000\]

This means that we will need $\text{users} \times \text{repositories}$ to be greater than 500000 to hit rate limiting.

To come up with a hopefully representative example for this post, I found a random article that claims some companies are hiring upwards of 3000 to 5000 developers, so let’s consider a case of 4000 developers and 5000 repositories (Microsoft has about 4.5k public repos alone, not including anything private or hosted in different organizations), and we get the following time to sync:

\[\left(\dfrac{\text{4000} \times \text{5000}}{100} \times 2 \right) / 5000 = 80 \text{ hours}\]

Three days is okay, but definitely enroaching into the territory of “cannot be done in a weekend”. In practice, implementation details mean that realistically we will consume far more requests than this, since we currently perform several types of sync¹, so the process will likely take longer than 80 hours.

The time to sync increases dramatically for even larger numbers of users and repositories - such as one customer that was projected to take upwards of an entire month to perform a full sync. Imagine paying thousands of dollars for a software product, only to have it unusable for the first month! Excessive rate limiting also means that permissions are far more likely to go stale, and can cause issues with other parts of Sourcegraph that also leverage GitHub APIs. The issue became a blocker for this particular customer, so we had to devise a solution to this issue.

Sourcegraph and repository authorization

I got my first hands-on experience with Sourcegraph’s authorization providers when expanding p4 protect support for the Perforce integration.

In a nutshell, Sourcegraph internally defines an interface authorization providers can implement to provide access lists for users (user-centric permissions) and repositories (repo-centric permissions) - authz.Provider - to populate a single source-of-truth table for permissions. This happens continuously and passively in the background. The populated table is then queried by various code paths that use the data to decide what content can and cannot be shown to a user.

Sourcegraph's repository permissions sync state indicator shows when the last sync occurred. Site administrators can also trigger a manual sync.

⚠️ Update: Since the writing of this post, I’ve contributed an improved and more in-depth description of how permissions sync works in Sourcegraph, if you are interested in a better overview: Repository permissions - Background permissions syncing.

For something like Perforce, user-centric sync is as simple as building a list of patterns from the Perforce protections table that work with PostgreSQL’s SIMILAR TO operator, like so:

// For the following p4 protect:
//    open user alice * //Sourcegraph/Engineering/.../Frontend/...
//    open user alice * //Sourcegraph/.../Handbook/...
// FetchUserPerms would return:
repos := []extsvc.RepoID{
    "//Sourcegraph/Engineering/%/Frontend/%",
    "//Sourcegraph/%/Handbook/%",
}

Repo-centric sync is left unimplemented in this case.

For GitHub, we query for all private repositories a user can explicitly access via their OAuth token, and return a list in a similar manner:

hasNextPage := true
for page := 1; hasNextPage; page++ {
	var err error
	var repos []*github.Repository
	repos, hasNextPage, _, err = client.ListAffiliatedRepositories(ctx, github.VisibilityPrivate, page, affiliations...)
	if err != nil {
		return perms, errors.Wrap(err, "list repos for user")
	}
	for _, r := range repos {
		addRepoToUserPerms(extsvc.RepoID(r.ID))
	}
}

Note that for public repositories, Sourcegraph simply doesn’t enforce permissions, so authorization only needs to care about explicit permissions.

The above is where we bump into GitHub’s rate limits easily - in a organization with 5000 repositories, that’s up to 50 API requests for each and every user to page through all their repositories. The GitHub authorization implementation also does the same thing for repo-centric permissions by listing all users with access to each repository.

Introducing a cache

Caches don’t solve all problems, but in this case there was an opportunity to save significant amounts of work through caching. GitHub repository permissions at companies are typically distributed through teams and organizations - membership to either would grant you access to relevant repositories, and teams are strict subsets of organizations. There are still instances of direct permissions - where a user is explicitly added to a repository - but it is unlikely to find a case of repositories without thousands of users added explicitly.

This means that in the vast majority of cases, when querying for user Foo’s repositories, we are actually asking what teams and organizations Foo is in. At a high level, we could do the following instead:

Get Foo’s direct repository affiliations
Get the organizations Foo is in
1. Get the teams a user is in within this organization
For each organization and team:
1. If organization allows read permissions on all repositories, or Foo is an organization administrator, get all organization repositories from cache as part of this Foo’s access list
2. Get all team repositories from cache eas part of Foo’s access list

Cache misses would prompt a new query to GitHub to mirror access lists for specific teams and organizations. In the best-case scenario, where all users are part of large teams and organizations and there are very few instances of being directly granted access to a repository, cache hits should be very frequent and greatly reduce the amount of work required. Going back to the earlier example of 4000 developers and 5000 repositories, we get a best case performance of:

\[\dfrac{(\text{teams} + \text{organizations}) \times \text{5000}}{100} = (\text{teams} + \text{organizations}) \times 50\]

Even if we had a 100 teams and organizations, this would fall under the hourly rate limit - a huge improvement from the previously projected 80 hours. Even in the worse case, this would only be marginally less efficient than the existing implementation.

To mitigate outdated caches, a flag to the provider interface was added to allow partial cache invalidation along the path of a sync (important because you don’t want every single team and organization queued for a sync all at once) and tying it into the various ways of triggering a sync (notably webhook receivers and the API).

The approach was promising, and a feature-flagged² user-centric sync backed by a Redis cache was implemented in sourcegraph#23978 authz/github: user-centric perms sync from team/org perms caches.

Two-way sync

As mentioned earlier, Sourcegraph’s authorization providers provide two-way sync: user-centric and repo-centric. To make the cache-backed sync complete, equivalent functionality had to be implemented for repo-centric sync.

Because GitHub organizations are conveniently supersets of teams (unlike some code hosts), user-centric cache was implemented with either organization or organization/team as keys and a big list of repositories as its value:

org/team: {
    repos: [repo-foo, repo-bar]
}

To make this cache work both ways, I simply added users to the cache values, and implemented a similar approach to finding a repository’s relevant organizations and teams. In this case, a relevant organization would be one that has default-read access (otherwise members of an organization do not necessarily have access to said repository).

This makes for somewhat large cache values, but also makes it easy to perform partial cache updates. For example, if user user-foo is created and added to org/team, the user can be added to the cache for org/team during user-centric sync, and subsequent syncs of repo-foo and repo-bar will include the new user without having the perform a full sync, and vice versa.

org/team: {
    repos: [repo-foo, repo-bar]
    users: [user-bar, user-foo]
}

On paper, the performance improvements gained here are similar to the ones when implementing caching for user-centric sync, except scaling off the number of users in teams and organizations instead of repositories.

This was implemented in sourcegraph#24328 authz/github: repo-centric perms sync from team/org perms caches.

Scaling in practice

Throughout the implementation of the cache-backed GitHub permissions mirroring, a large number of unit tests were included, as well as a few integration tests, that tested the behaviour of various combinations of cache hits and misses.

To write integration tests, we use “golden testing”, where we record network interactions to a file (called “VCRs”). Tests then use the recorded network interactions instead of reaching out to external services by default, unless explicitly asked to update the recordings. Interestingly, despite the significant improvements of this approach for larger numbers of users and repositories, this also made clear just how inefficient the cache-based approach is for smaller instances:

with caching disabled, the integration test recorded just 2 network requests for repo-centric sync.
with caching enabled, the integration test recorded a whopping 22 network requests for repo-centric sync with the exact same number of repositories and users

This is why we continue to leave the cache-backed sync as a opt-in behaviour.

However, despite reasonably robust testing of the behaviour of the code, we had no way to easily perform and end-to-end test of this at the scale of thousands of repositories and users with the appropriate teams and organizations. In hindsight, I could have invested some effort into generating VCRs to emulate such an environment and test against it, but with the agreement of the customer requesting this the decision was made to ship the changes and ask them to try it out.

Debug logging

All was well at first in the trial run - the backlog of repositories queued for an initial permissions sync was very rapidly being worked through, with a projected 3-day time to full sync - a huge improvement from the the previously projected 30 days. However, with just a few thousand repositories left to process, the sync stalled.

Metrics indicated jobs were timing out, and a look at the logs revealed thousands upon thousands of lines of random comma-delimited numbers. It seemed that printing all this junk was causing the service to stall, and sure enough setting the log driver to none to disable all output on the relevant service allowed the sync to proceed and continue.

Where did the log come from? I left a stray log.Printf("%+v\n", group) in there when I was debugging cache entries. At scale these entries could contain many thousands of entries, causing the system to degrade. Be careful what you log!

Postgres parameter limits

A service we call repo-updater has an internal service called PermsSyncer that manages a queue of jobs to request updated access lists using these authorization providers for users and repositories based on a variety of heuristics such as permissions age, as well as on events like webhooks and repository visits (diagram). Access lists returned by authorization providers are upserted into a single repo_permissions table that is the source of truth for all repositories a Sourcegraph user can access, and vice versa.

Entries can also be upserted into a table called repo_pending_permissions, which is home to permissions that do not have a Sourcegraph user attached yet. When a user logs in via a code host’s OAuth mechanism to Sourcegraph, the user’s Sourcegraph identity attached to the user’s identity on that code host (this allows a Sourcegraph user to be associated with multiple code hosts), and relevant entries in repo_pending_permissions are “granted” to the user.

This means that once the massive number of repositories in the trial run was fully mirrored from GitHub, a user attempting to log in could have a huge set of pending permissions granted to it all at once. Of course, this broke with a fun-looking error:

execute upsert repo permissions batch query: extended protocol limited to 65535 parameters

I was able to reproduce this in an integration test of the relevant query by generating a set of 17000 entries:

{
	name:     postgresParameterLimitTest,
	updates: func() []*authz.UserPermissions {
		user := &authz.UserPermissions{
			UserID: 1,
			Perm:   authz.Read,
			IDs:    toBitmap(),
		}
		for i := 1; i <= 17000; i += 1 {
			user.IDs.Add(uint32(i))
		}
		return []*authz.UserPermissions{user}
	}(),
	expectUserPerms: func() map[int32][]uint32 {
		repos := make([]uint32, 17000)
		for i := 1; i <= 17000; i += 1 {
			repos[i-1] = uint32(i)
		}
		return map[int32][]uint32{1: repos}
	}(),
	expectRepoPerms: func() map[int32][]uint32 {
		repos := make(map[int32][]uint32, 17000)
		for i := 1; i <= 17000; i += 1 {
			repos[int32(i)] = []uint32{1}
		}
		return repos
	}(),
},

This would break because we were performing an insert of 4 values per row, and at 17000 rows we reach 68000 parameters bound to a query. Postgres uses Int16 codes to denote bind variables, which would mean a maximum of $2^{16} =$ 65536 parameters (hence the seemingly magic number indicated in the error).

INSERT INTO repo_permissions 
	(repo_id, permission, user_ids_ints, updated_at)
VALUES
	%s
ON CONFLICT ON CONSTRAINT
	/* ... */

Funnily enough, you can get around this by providing columns as arrays. In this case, if you can provide each of the 4 columns here as an array, that would only count for 4 parameters, allowing this insert to scale indefinitely!

Sadly, one of the columns here is of type INT[]. When I attempted to perform an UNNEST on an INT[][], it completely unwrapped the array instead of just unwrapping it by a single dimension like one might expect:

SELECT * FROM unnest(ARRAY['hello','world']::TEXT[], ARRAY[[1,2],[3,4]]::INT[][])

Frustratingly returns:

unnest	unnest
hello	1
world	2
	3
	4

When the desired result was just a one-dimensional unwrapping:

unnest	unnest
hello	[1, 2]
world	[3, 4]

I briefly toyed with the idea of hacking around this by combining the array type as a single string and splitting it on the fly:

SELECT
 a,
 string_to_array(b,',')::INT[]
FROM
 unnest(ARRAY['hello','world']::TEXT[], ARRAY['1,2,3','4,5,6']::TEXT[]) AS t(a, b)

An EXPLAIN ANALYZE on the 5000-row sample query that didn’t hit the parameter limit, however, indicated that the performance of this was about 5x worse than before (with a cost of 337.51, compared to the previous cost of 62.50). It was also a bit of a dirty hack anyway, so I ended up resorting to simply paging the insert instead to avoid hitting the parameter limit. This was implemented in sourcegraph#24852 database: page upsertRepoPermissionsBatchQuery.

However, it seemed that this was not the only instance of us exceeding the parameter limits. Another query was running into a similar issue on a different customer instance. This time, there were no array types in the values being inserted, so I was able to try out the insert-as-arrays workaround:

INSERT INTO user_pending_permissions 
  (service_type, service_id, bind_id, permission, object_type, updated_at) 
VALUES
- %s
+ (service_type, service_id, bind_id, permission, object_type, updated_at)
+ (
+   SELECT %s::TEXT, %s::TEXT, UNNEST(%s::TEXT[]), %s::TEXT, %s::TEXT, %s::TIMESTAMPTZ 
+ )
ON CONFLICT ON CONSTRAINT
  /* ... */

This implementation of the query was slower for smaller cases, but for larger datasets was either on par or faster than the original query:

Case	Accounts	Cost	Clock	Comparison
Before	100	`0.00..1.75`	287.071 ms
After	100	`0.02..1.51`	430.941 ms	~50% slower
Before	5000	`0.00..87.50`	7199.440 ms
After	5000	`0.02..75.02`	7218.860 ms	~same
Before	10000	`0.00..175.00`	16858.613 ms
After	10000	`0.02..150.01`	14566.492 ms	~13% faster
Before	15000	fail	fail
After	15000	`0.02..225.01`	22938.112 ms	success

I originally had the function decide which query to use based on the size of the insert, but during code review it was recommended that we just stick to one implementation for simplicity, since permissions mirroring happens asynchronously and is not particularly latency-sensitive.

This was implemented in sourcegraph#24972 database: provide upsertUserPendingPermissionsBatchQuery insert values as array.

Results

After working through the issues mentioned in this article as well as a variety of other minor fixes, the customer was finally able to run a full permissions mirror to completion with everything working as expected. The final result was roughly 2.5 days to full sync, a more than 10x improvement to the previously projected 30 days. The improved performance unblocked the customer in question on this front and will hopefully open the door for Sourcegraph to function fully in even larger environments in the future!

About Sourcegraph

Interested in joining? We’re hiring!

See Two-way sync. ↩
Well, admittedly, it was only feature-flagged to off by default in a follow-up PR when I realised this required additional authentication scopes we do not request by default against the GitHub API (in order to query organizations and teams). ↩

June 2021 updates for bobheadxi.dev

2021-06-20T00:00:00+00:00

With dark mode on every website nowadays, my website seems to have fallen a bit behind the times. I decided it was about time to give my website a bit of a facelift - and over-hype it with a blog post!

This round of improvements didn’t strictly happen this month, but a lot of it was spurred on by my recent reading of the iA Design Blog. I think their website is absolutely gorgeous, and it made the lacklustre of bobheadxi.dev all the more apparent.

For the unfamiliar, my site started off over 2 years ago with the indigo Jekyll theme. I have since made quite a number of changes to it, mostly in random spurts of effort, and started writing about these periods of changes last year.

I quite like how things turned out for this set of changes - hope you do as well!

Updated typography

A big part of bobheadxi.dev is my blog posts, even though I’m unsure how many people read them (Google Analytics indicates a lot of traffic, particularly on my really old Object Casting in Javascript post). Anyway, I’ve always been rather dissatisfied with the reading experience on my site, but could never quite put my finger on what exactly was wrong with it.

All I knew was that I didn’t like the previous fonts - ‌Helvetica Neue - but until I started using iA Writer recently, I didn’t have much of an inkling of what font I would like.

iA Writer uses these gorgeous fonts - aptly named Mono, Duo, and Quattro - that I think looks so nice when typing and reading. They have a neat blog post introducing these fonts, and while I’m not really sure what this stuff means, I decided to make the switch.

This site now uses Quattro as its serif font, and Mono as its monospaced font. I think the results are quite nice.

Outdented heading anchors

While editing in iA Writer, headings get nicely outdented ‘#’s like so:

When I started thinking about it, I’m pretty sure this is a very common style in many websites already. Either way, I quite like how it looks, so I tried to replicate it on my site. I currently generate somewhat similar-looking (but not outdented) anchor links using allejo/jekyll-anchor-headings, which allows a little bit of customization - I can give the anchor link elements a class, for example, and style it through that.

 class="post-content">
    {% include anchor_headings.html html=content anchorBody='#' anchorClass='heading-anchor' beforeHeading=true %}

Turns out the outdenting can be achieved using the handy translateX transformation, and a bit of @media helps me scale this effect for smaller screens (where outdenting could position the anchors very close to the edge of your screen).

h1, h2, h3, h4
	// ... some CSS
	> .heading-anchor
		position: absolute
		transform: translateX(-2rem)
		@media #{$tablet}, #{$mobile}
			position: inherit
			transform: none

Sadly, I wasn’t able to figure out a nontrivial way to have the number of ‘#’s correspond to the depth of the heading, but I figured this was close enough, and is definitely an improves the look of headings (in my opinion).

Bold introductions

Some books and blogs get big first letters for the first paragraph of a chapter or article. The effect looks nice on books, but I was never really sold on its usage in blog posts - though the look of an emphasised introduction is certainly striking. As I browsed through iA Design Blog, I noticed that their first paragraphs were big, and it made each essay feel much more compelling.

However, as I went about considering different options for making my intros real big as well, I realised a lot of my introductory paragraphs were complete garbage. While sometimes that was the intent - leading with a tangent before diving into the article’s main topic - they definitely did not age well.

So perhaps a fortunate side effect is that this prompted me to go back through my posts and make the bare minimum effort to make them a bit more interesting. At least I look like I know what I’m talking about now!

Exciting listings

I just learned about Jekyll’s post.excerpt feature that gives you the first paragraph of a blog post. Again inspired by the iA Design Blog, which uses excerpts instead of custom descriptions to great effect, I decided to use them here as well.

I think this gives a far better preview into the content of each post, and kind of makes them look more important. Thankfully my updating of each post’s first paragraphs to accommodate bigger introductions meant that the excerpts are at least somewhat meaningful.

I also made minor improvements such as adding an on-hover effect to the clickable tags, which previously had no indication they were clickable.

The big picture

I like to include all sorts of media in my blog posts - images, code snippets, diagrams, quotes, and more. Unfortunately, I also like somewhat narrow widths for my content, which makes for a poor viewing experience for various forms of media.

On articles in the Sourcegraph Blog (and I recall that you can do this on Medium as well), I noticed that images were “blown up” - wider than the content - and I thought the effect looked quite nice, giving an expansive canvas for media to be enjoyed while still maintaining a nice reading experience for all the other stuff.

To do this myself, I turned images I wanted to be blown up into

elements, and gave them expanded widths, along with

. This also served nicely to standardise the raw HTML I’d been previously using to give images captions.

Big!!!!

Code blocks ran into similar problems, where snippets I didn’t careful adjust to adhere to an 80-character line limit would have to be scrolled to viewed, even on very wide screens. So I made them massive.

I’ve also always liked the big quotes used in magazine and newspaper sites to give quotes an even more authoritative and dramatic feel - so quotes joined the big club.

Mermaid diagrams and some other things I might have forgotten also got this treatment. Hopefully these changes make the reading experience more exciting!

Dark mode

And last but not least, the star of today’s show… dark mode! Because no site is complete without one.

The site now switches do dark mode if you have dark mode enabled on your device!

Luckily for me, the theme my site was based on made decent use of SASS variables for colours (though the naming of the colours left quite a bit to be desired, as you’ll see in a moment).

I found to my dismay that because these variables are compiled away at build time, they cannot be used to respond to prefers-color-scheme: dark, which seems to be the standard way to detect for what theme you should show to the user.

Instead, I found some blog posts talking about CSS variables, which turns out to be the only way to have properly variable variables in stylesheets. To be honest this is the first time I’ve had to do something like this myself, and this was news to me!

My implementation ended up pretty straight forward, using universal selectors and setting the theme in JavaScript, though I’m sure there are other ways to do this too (maybe even JavaScript-free?).

[data-theme="theme-light"]
    --background: #ffffff
    --alpha: #333
    --beta: #222
    --gama: #aaa
    --delta: #5A85F3
    --epsilon: #ededed
    --zeta: #666

[data-theme="theme-dark"]
    --background: #141414
    --alpha: #aaa
    --beta: #eeeeee
    --gama: #474747
    --delta: #5A85F3
    --epsilon: #202020
    --zeta: #929292

var prefersDark = false;
function setDarkMode(isDark) {
    const theme = `theme-${isDark ? 'dark' : 'light'}`;
    document.querySelector('html').dataset.theme = theme;
    prefersDark = isDark;
    console.log(`Set ${theme}`);
}

// set the initial theme
const prefersDarkMatch = window.matchMedia('(prefers-color-scheme: dark)');
setDarkMode(prefersDarkMatch.matches);

// watch for changes to the user's dark mode configuration
prefersDarkMatch.addEventListener('change', (e) => setDarkMode(e.matches));

Having the setDarkMode function available is useful for development, allowing me to switch between the modes via console, and I added the prefersDark variable… just because, I guess. Maybe handy if I want to add a button to toggle dark mode?

In the end, despite picking the colours semi-randomly and not making an awful lot of adjustments, I’m pretty happy with how this (in my opinion) quick effort turn out! I’m particularly pleased with how the blog listings look:

Up next

There are still a lot of issues with dark mode - most noticeably the company logos I’m using that don’t have transparent backgrounds, but also a few contrast issues in code highlighting.

There also seems to be an issue with the tags page where posts from different collections do not get included that I definitely want to fix now that interaction with tags is more prominent.

I recently wrote a newsletter featuring a ludicrous number of footnotes, and at some point I want to get Tufte “sidenotes” here so that I can abuse footnotes in my blog posts as well. Sadly, I haven’t found a particularly elegant solution to this, so I’m putting it off for the time being.

And, of course, I’m hoping to do more blog-writing as well.

That’s all for now - feel free to highlight anything on this post if you have comments for questions!

Semantic line breaks

2021-02-18T00:00:00+00:00

As an organisation grows, it becomes increasingly important to record knowledge and processes. One popular approach is using a collection of Markdown files, tracked in Git, where changes can easily be proposed and discussed. Unfortunately, the readability and understandability of these changes is often quite poor, negating much of the benefits of using a version control system.

Consider what a change - or a “diff” - usually looks like:

- this line was removed
+ this line was added

How does this play with changes to documentation? In general, Markdown files are written with lines breaks at some arbitrary character column (such as 80 characters), or are written with entire paragraphs on a single line. Both these approaches have significant issues:

Line-breaking at some arbitrary character column looks nice when viewed in a terminal or code editor, but the consistency of line widths is easily lost when making and suggesting edits, necessitating reflowing entire paragraphs and creating unreadable diffs. This leads to incomprehensible or uninformative diffs that are difficult to review.
Writing entire paragraphs on a single line is reasonably readable nowadays due to most editors and viewers performing wrapping out-of-the-box, but they make suggestions and diffs difficult to review due to every single change causing a diff on entire paragraphs.

In the example above, the diff is small and there is not too much going on, so it is easy to see what has changed. Consider the following text, where we want to change incididunt with I am so hungry:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

If the text was broken at a character column, the resulting diff (including reflowing the text) might look like:

- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt
+ Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor I am so
- ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation
+ hungry ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud
- ullamco laboris nisi ut aliquip ex ea commodo consequat.
+ exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

This can be rather incomprehensible. If the text was not broken at all, the diff would then look like:

- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
+ Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor I am so hungry ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

This is marginally better, but still quite difficult, especially because not all git interfaces will be able to show you the specific word that has changed (and even fewer that can do that for very, very long lines, as is the case for paragraphs of many sentences).

To combat this, the idea of semantic line breaks has been floated. The general idea is to perform line breaks along semantic boundaries, instead of just along paragraphs. An approach suggested at sembr.org sums this up as:

When writing text with a compatible markup language, add a line break after each substantial unit of thought.

This particular specification goes on to describe how this works:

Many lightweight markup languages, including Markdown, reStructuredText, and AsciiDoc, join consecutive lines with a space. Conventional markup languages like HTML and XML exhibit a similar behaviour in particular contexts. This behaviour allows line breaks to be used as semantic delimiters, making prose easier to author, edit, and read in source — without affecting the rendered output. […] By inserting line breaks at semantic boundaries, writers, editors, and other collaborators can make source text easier to work with, without affecting how it’s seen by readers.

In my interpretation, a good semantic line break specification then ought to:

Make use of how most Markdown specifications ignore single new lines to still provide a good rendered Markdown experience.
Leverage modern line-wrapping in most viewers to maintain a good raw Markdown experience.
Maintain understandable diffs in Markdown documentation for a good reviewing experience.

I quite like this idea! Perhaps semantic line breaks could allow us to break this paragraph of text into smaller chunks, and make small diffs significantly more approachable, simpler to reason about, and easier to discuss.

Solving unreadable changes

sembr.org proposes a set of rules that would make content easier to manage and make changes to. Their website presents the following example:

All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.

Their recommendation is to change this to:

All human beings are born free and equal in dignity and rights.
They are endowed with reason and conscience
and should act towards one another in a spirit of brotherhood.

Recommendation is the crux of the problem here, and is a significant barrier to adoption. The sembr.org specification depends entirely on the writer to maintain the appropriate formatting, and it leaves the interpretation of what a “semantic boundary” is at all up in the air. Nine of the twelve requirements in this particular specification are MAY’s, SHOULD’s, and RECOMMEND’s! This is surely to lead to:

Inconsistent and difficult documents, thanks to so much of the specification being up for interpretation.
Contributors forgetting to add, or simply not wanting to go through the trouble of adding, the necessary line breaks.
Someone is going to be frustrated at someone else’s very short lines, and refuse to format appropriately. Alternatively, they might disagree with someone else’s line breaks, and cause unnecessary churn in diffs.

Both of these problems pose significant barriers to widespread adoption, which is necessary for any semantic line break specification to be of any use.

A formatter for semantic line breaks

A similar problem arises with code standards: semicolons? Spaces or tabs? Left up to individuals, no standard will ever be truly consistent, especially in the face of the need to “just get the job done”. In code formatting, this has primarily been solved mostly through automated tooling. Why bother arguing about semicolons if a program will just do it for you, and will even check if everything is consistent?

What if the same thing could happen for documentation source: a tool to automatically format your text? To accommodate this, I propose a simpler specification that still offers a small amount of customization:

A semantic boundary is defined to be the end of a sentence.
Allow multiple short sentences to be part of a single line, up to a character threshold.
After a character threshold, a semantic boundary should be followed by a line break.

A simpler set of rules reduces the opens the door to potential automation (a program would not need to make as many complicated decisions), and still achieves part of our original goal: changes now reflect changes to ideas within semantic boundaries, and more accurately reflect the idea being changed.

Returning to the Lorem ipsum example, with this version of semantic line breaks, our change might look like:

- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
+ Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor I am so hungry ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

In this diff, it is significantly clearer what idea has changed, as encapsulated by the sentence it belongs in. This makes it easier to understand the context of the change being made, reason about it, and open discussions regarding it.

I’ve taken a stab at creating just such a tool, Readable, which will add semantic line breaks to any document for you with a single command, for example readable fmt **/*.md.

It will also feature commands to preview changes, perform changes as you edit, and checks that can be run in continuous integration. So far it seems very promising, but there are a lot of edge cases to sort out and fix still.

Readable is being built in TypeScript with Deno, a handy new TypeScript and Javascript runtime. Follow the project on GitHub!

Extending Docker images with sidecar services

2020-06-21T00:00:00+00:00

Many open-source services are distributed as Docker images, but sometimes you’ll want to extend the functionality slightly - whether it be adding your own endpoints, manipulating configuration of the service within the Docker image, or something along those lines.

In some cases, such as for manipulating configuration, most images will allow you to mount configuration within the container or use environment variables, so you can build a proper sidecar service to do whatever updates you want and restart the target container. The same goes for extending endpoints - a proper sidecar can serve you well. You can have one service manage the a large number of containers, which is what I did for a project I worked on at RTrade, Nexus.

There’s a significant convenience factor to keeping your service as a single container however - it’s far easier to distribute and easier to deploy, and if you are trying to extend an off-the-shelf service like Grafana that lives within a large, multi-service deployment like Sourcegraph, adding additional services becomes quite a pain. Heck, even adding an additional port is something that must have additional configuration propagated across an entire fleet of services across various deployment methods.

This article goes over the approach I took to achieve the following without significantly changing the public interface of our Grafana image:

subscribe to core Sourcegraph configuration from another service
apply changes to the Grafana instance through API calls or configuration changes
report problems in the sidecar process

While I’ll generally refer to Grafana in this writeup, you can apply it to pretty much any service image out there. I also use Go here, but you can draw from the same concepts to leverage your language of choice as well.

⚠️ Update: Since the writing of this post, we have pivoted on the plan (sourcegraph#11452) and most of the work here no longer lives in our Grafana distribution, but is instead a part of our Prometheus distribution - see sourcegraph#11832 for the new implementation. You can explore the source code on Sourcegraph, and relevant documentation here.

Most of this article still applies though, but with Prometheus + Alertmanager instead of Grafana.

Wrapping the sidecar and the service
Implementing the wrapper
- Adding endpoints
- Restarting the service
Source code and pull requests
About Sourcegraph

Wrapping the sidecar and the service

In a nutshell, the primary change made to the Grafana image is an adjustment to the entrypoint script:

- exec "/run.sh"               # run the Grafana image's default entrypoint
+ exec "/bin/grafana-wrapper"  # run our sidecar program, implemented as a wrapper

I’ll go over the specifics of the wrapper in the next section, since I think it’ll help to understand how we’re extending the vanilla image. You’ll want to set up a Dockerfile that builds the program and copies it over to the final image, which should be based on the vanilla image:

FROM golang:latest AS builder

# ... build your sidecar

FROM grafana/grafana:latest AS final

# copy your compiled program from the builder into the final image
COPY --from=builder /go/bin/grafana-wrapper /bin/grafana-wrapper

ENTRYPOINT ["/entry.sh"]

The goal here is to start a wrapper program that will start up your sidecar and the actual service within the image you are trying to extend (grafana/grafana in this case).

Implementing the wrapper

Depending on what level of functionality you want to achieve, this program can be as simple as a server that makes API calls to the main service. For example:

sequenceDiagram
    participant Sidecar
    participant Service
    note right of Service: the program
you are extending

    activate Sidecar
    Sidecar->>Service: cmd.Start
    activate Service

    loop stuff
        Sidecar->>Service: Requests
        Service->>Sidecar: Responses
    end

    Service->>Sidecar: cmd.Wait returns
    deactivate Service

    deactivate Sidecar

This can be achieved using the Go standard library’s os/exec package to run the main image entrypoint, start up the sidecar, and simply wait for the entrypoint to exit.

import (
    "errors"
    "os"
    "os/exec"
)

// newGrafanaRunCmd instantiates a new command to run grafana.
func newGrafanaRunCmd() *exec.Cmd {
    cmd := exec.Command("/run.sh")
    cmd.Env = os.Environ() // propagate env to grafana
    cmd.Stderr = os.Stderr
    cmd.Stdout = os.Stdout
    return cmd
}

func main() {
    grafanaErrs := make(chan error)
    go func() {
        grafanaErrs <- newGrafanaRunCmd().Run()
    }()

    go func() {
        // your sidecar
    }()

    // wait for grafana to exit
    err := <-grafanaErrs
    if err != nil {
        // propagate exit code outwards
        var exitErr *exec.ExitError
        if errors.As(err, &exitErr) {
            os.Exit(exitErr.ProcessState.ExitCode())
        }
        os.Exit(1)
    } else {
        os.Exit(0)
    }
}

Adding endpoints

What if both your sidecar and the extended service expose endpoints over the network? Sure, you could simply have the sidecar listen on a separate port, but that would involve adding another port to expose on your container, and adds another point of configuration that dependents need to be aware of before they can connect to your service.

My solution to this is to keep the same container “interface” by having a reverse proxy listen on the exposed port, which would handle forwarding requests to either the main service or the sidecar.

graph TB
    subgraph Container
        R{Router}
        Sidecar
        Service
        ReverseProxy
    end

    Dependent <-- $PORT --> R
    R <-- sidecarHandler --> Sidecar
    R <--> ReverseProxy
    ReverseProxy <-- internalServicePort --> Service

Again, the Go standard library comes to the rescue with the net/http/httputil package. We also use gorilla/mux for routing in this example, but you can choose any routing library that serves your needs.

import (
    "net/http/httputil"
    "github.com/gorilla/mux"
)

func main() {
    // ... as before

    router := mux.NewRouter()

    // route specific paths to your sidecar's endpoints
    router.Prefix("/sidecar/api", sidecar.Handler())

    // if a request doesn't route to the sidecar, route to your main service
    router.PathPrefix("/").Handler(&httputil.ReverseProxy{
        // the Director of a ReverseProxy handles transforming requests and
        // sending them on to the correct location, in this case another port
        // in this container (our service's internal port)
        Director: func(req *http.Request) {
            req.URL.Scheme = "http"
            req.URL.Host = fmt.Sprintf(":%s", serviceInternalPort)
        },
    })

    go func() {
        // listen on our external port - the port that will be exposed by the
        // container - to handle routing
        err := http.ListenAndServe(fmt.Sprintf(":%s", exportPort), router)
        if err != nil && !errors.Is(err, http.ErrServerClosed) {
            os.Exit(1)
        }
        os.Exit(0)
    }()

    // ... as before
}

Restarting the service

In my case, I eventually had to add restart capabilities, since some configuration changes required the service to be restarted.

Simply restarting the container was not an option, since it would complicate how the configuration would persist, and would cause us to lose the advantage of having a single self-isolated container that required no external care.

Fortunately, exec.Cmd, once started, provides an *os.Process that we can use to stop an existing process. I introduced a controller that would expose functions through which the sidecar can stop and start the main service:

type grafanaController struct {
    mux  sync.Mutex
    proc *os.Process
}

Stopping is pretty straight-forward - if the service is running, proc will be non-nill, and we can simply signal it to stop:

func (c *grafanaController) Stop() error {
    c.mux.Lock()
    defer c.mux.Unlock()

    if c.proc != nil {
        if err := c.proc.Signal(os.Interrupt); err != nil {
            return fmt.Errorf("failed to stop Grafana instance: %w", err)
        }
        _, _ = c.proc.Wait() // this can error for a variety of irrelvant reasons
        if err := c.proc.Release(); err != nil {
            c.log.Warn("failed to release process", "error", err)
        }
        c.proc = nil
    }
    return nil
}

Notice how this is starting to look a bit gnarly:

A failed proc.Wait() does not strictly indicate that the shutdown failed, but could also indicate that the process shut down immediately (before proc.Wait() could run). However, it is still import to wait, since a signal does not indicate the service has stopped completely.
A failed proc.Release() does not strictly indicate a fatal error, so we log and continue as if nothing has happened.

Starting the service is even less appealing - we can’t just start the service on a goroutine and ignore it, since we want to be aware of and log errors. However, not every error should be fatal, and the line is blurry.

func (c *grafanaController) RunServer() error {
    c.mux.Lock()
    defer c.mux.Unlock()

    // spin up grafana and track process
    c.log.Debug("starting Grafana server")
    cmd := newGrafanaRunCmd()
    if err := cmd.Start(); err != nil {
        return fmt.Errorf("failed to start Grafana: %w", err)
    }
    c.proc = cmd.Process

    // capture results from grafana process
    go func() {
        // cmd.Wait output:
        // * exits with status 0 => nil
        // * command fails to run or stopped => *ExitErr
        // * other IO error => error
        if err := cmd.Wait(); err != nil {
            var exitErr *exec.ExitError
            if errors.As(err, &exitErr) {
                exitCode := exitErr.ProcessState.ExitCode()
                // unfortunately grafana exits with code 1 on sigint
                if exitCode > 1 {
                    c.log.Crit("grafana exited with unexpected code", "exitcode", exitCode)
                    os.Exit(exitCode)
                }
                c.log.Info("grafana has stopped", "exitcode", exitCode)
                return
            }
            c.log.Warn("error waiting for grafana to stop", "error", err)
        }
    }()

    return nil
}

Just like errors in libraries, exit codes are often up to the jurisdiction of the developer, and in this case Grafana does not give us a useful indication of whether a process has stopped because of an intentional SIGINT, or if a fatal error occurred causing it to exit (and indicating that we should exit our controller).

You can add some additional management (i.e. a thread-safe flag or channel to indicate that a shutdown has been triggered intentionally, and only exit on code 1 if this flag is not set), but the complexity of what is meant to be a simple wrapper will quickly ramp up.

At the time of writing I’m not sure that any additional handling is required for a reasonable experience, but I’ll be keeping an eye on how this code behaves.

Note that this also means we can no longer just block on the main program until the service exits, since it can exit (intentionally) at any time - instead, we must depend on an external SIGINT to tell us when to stop:

import (
    "os"
    "os/signal"
)

func main() {
    // ... mostly as before

    c := make(chan os.Signal, 1)
    signal.Notify(c, os.Interrupt)
    <-c
    if err := grafana.Stop(); err != nil {
        log.Warn("failed to stop Grafana server", "error", err)
    }
}

Source code and pull requests

And that’s it for a rudimentary sidecar service that allows you to continue treating a service container as a completely isolated unit!

Some relevant pull requests implementing these features:

sourcegraph#11427 - I ended up reverting this due to bugs in certain environments and adding it back in sourcegraph#11483, but both PRs include relevant discussions. These PRs implements a basic sidecar without start and restart capabilities.
sourcegraph#11554 adds the ability for the sidecar to start and restart the main service.

Note that most of the above work has been superseded by a pivot to Prometheus (see the update at the start of this post). Following the pivot, a lot of other work was enabled by the addition of this sidecar:

sourcegraph#12010 (implementation: sourcegraph#12491) proposed a mechanism for denoting ownership in our monitoring and routing alerts appropriately.
sourcegraph#17602 demonstrated potential summary capabilities a sidecar can export.
sourcegraph#17014 and sourcegraph#17034 adds timestamped links to relevant Grafana panels to alert messages.

About Sourcegraph

Learn more about Sourcegraph here.

robert lin

Scaling Sourcegraph’s managed multi-single-tenant product

The prototype

Version 2

Taking things to the control plane

Reconciliation

Writing the code

Control plane lifecycle summary

The future

About Sourcegraph

Investing in the development of the developer experience

Tooling should be approachable

Tooling should work with your tools

Tooling should codify standards

Wrap-up

About Sourcegraph

Anatomy of a logger

A writer for log entries

Check

Write

Encoding and writing output

tl;dr

About Sourcegraph

Dynamic and stateless Kubernetes Jobs for stable CI

Preparing for the switch

Static Kubernetes Jobs

Dynamic Kubernetes Jobs

Observability

Git mirror caches

Stateless agents

About Sourcegraph

Extending Sourcegraph search

Introducing a search job

Sending results over the wire

Querying the database for real results

Implementing notebook blocks results

Rendering search notebook results

Wrap-up

About Sourcegraph

Self-documenting and self-updating tooling

Observability ecosystem

Continuous integration pipelines

Wrap-up

About Sourcegraph

Mirroring GitHub permissions at scale

GitHub rate limits

Sourcegraph and repository authorization

Introducing a cache

Two-way sync

Scaling in practice

Debug logging

Postgres parameter limits

Results

About Sourcegraph

June 2021 updates for bobheadxi.dev

Refinements

Updated typography

Outdented heading anchors

Bold introductions

Exciting listings

The big picture

Dark mode

Up next

Semantic line breaks

Solving unreadable changes

A formatter for semantic line breaks

Extending Docker images with sidecar services

Wrapping the sidecar and the service

Implementing the wrapper

Adding endpoints

Restarting the service

Source code and pull requests

About Sourcegraph