user: implement user factory by emilyalbini · Pull Request #106 · oxidecomputer/buildomat

emilyalbini · 2026-05-06T17:12:43Z

This PR implements a new factory for buildomat, spawning jobs in ephemeral users in the same host system running the factory. Documentation on how to use the factory is available in the factory README.

I was careful during the implementation of the factory to make sure it will always attempt to clean up after itself (never releasing a slot until the cleanup stage finishes) and that it alerts the operator when something goes wrong (by failing the worker, which triggers an hold on it: I plan in the future to add monitoring for held workers).

This PR also makes multiple changes to the agent installation to support this, each in its separate commit. I can move those to a single separate PR or multiple separate PRs if you'd prefer.

The implementation of the factory was based on @jclulow's 2024 work on a work-in-progress hubris factory.

lzrd

I'm currently testing against this PR and noticed one minor doc vs code issue.

jclulow

I've started taking a look at this, and have left some thoughts on what I've seen so far. I think it would help to have more of a complete picture of how this will get deployed and configured for hubris CI as well, when evaluating it all.

jclulow · 2026-05-07T06:05:26Z

     * Install the agent binary with the control program name in a location in
     * the default PATH so that job programs can find it.
     */
-    let cprog = format!("/usr/bin/{CONTROL_PROGRAM}");


I don't want to move the control program location for environments where it has already existed in /usr/bin. That's part of what this abstraction is about:

buildomat/agent/src/main.rs

Lines 42 to 98 in c7805b4

enum Root {

Global,

PerUser(PathBuf),

}

impl Root {

fn etc(&self) -> PathBuf {

match self {

Root::Global => "/opt/buildomat/etc".into(),

Root::PerUser(top) => top.join("etc"),

}

}

fn lib(&self) -> PathBuf {

match self {

Root::Global => "/opt/buildomat/lib".into(),

Root::PerUser(top) => top.join("lib"),

}

}

fn usrbin(&self) -> PathBuf {

match self {

Root::Global => "/usr/bin".into(),

Root::PerUser(top) => top.join("bin"),

}

}

pub fn config_path(&self) -> PathBuf {

self.etc().join("agent.json")

}

pub fn job_path(&self) -> PathBuf {

self.etc().join("job.json")

}

pub fn agent(&self) -> PathBuf {

self.lib().join("agent")

}

pub fn control_program(&self) -> PathBuf {

self.usrbin().join(CONTROL_PROGRAM)

}

pub fn should_install_service(&self) -> bool {

match self {

Root::Global => true,

Root::PerUser(_) => false,

}

}

pub fn unprivileged(&self) -> bool {

match self {

Root::Global => false,

Root::PerUser(_) => true,

}

}

}

jclulow · 2026-05-07T06:08:43Z

-        /*
-         * Ubuntu 18.04 had a genuine pre-war separate /bin directory!
-         */
-        let binmd = std::fs::symlink_metadata("/bin")?;
-        if binmd.is_dir() {
-            std::os::unix::fs::symlink(
-                format!("../usr/bin/{CONTROL_PROGRAM}"),
-                format!("/bin/{CONTROL_PROGRAM}"),
-            )?;
-        }


Can we leave this here?

jclulow · 2026-05-07T07:11:59Z

+<!DOCTYPE service_bundle SYSTEM '/usr/share/lib/xml/dtd/service_bundle.dtd.1'>
+<service_bundle name='buildomat-worker' type='manifest'>
+    <service name='site/buildomat/factory-user-worker' type='service' version='0'>
+        <exec_method name='start' type='method' timeout_seconds='60' exec='{{exec}}' />


We ought to use a method context here that constrains the process to the unprivileged build user for the instance, so that it doesn't start out running as root -- like

buildomat/factory/hubris/smf/hubris.xml

Lines 12 to 14 in c7805b4

<method_context>

<method_credential user='build' group='build' />

</method_context>

In order for the chroot(2) to work we'll need to grant the process the proc_chroot privilege (not in the basic set). Then we'll need to drop that privilege as soon as the chroot() is done, using setppriv(2), prior to doing anything else so that when we then download and run the agent binary it can't chroot again.

We probably also want to remove proc_info (which is part of the basic set) so that you can't see other processes on the system that belong to other users/jobs in, say, ps(1) output. There might be other privileges that it makes sense to chuck out here, but that's the one that comes to mind immediately.

Because this factory intends to support multiple concurrent jobs on the machine, we should also look at putting each build user in a separate project(5), and then setting some resource_controls(7) on those projects to prevent one job from having too much of an impact on other jobs that are running concurrently. We might also want to look at the FSS(4) scheduler, which can some amount of scheduler fairness at the project rather than process level.

jclulow · 2026-05-07T08:01:23Z

+}
+
+fn root_dir(worker: WorkerId) -> PathBuf {
+    Path::new("/var/run/buildomat/worker-roots").join(worker.to_string())


I think we ought to create a two tier structure here:

top level, /var/run/buildomat/worker/WORKER_ID which would be owned by the (unprivileged) user and group for the worker, and mode 0700, so that it's only visible and traversable to the specific worker

another directory one level down, e.g., /var/run/buildomat/worker/WORKER_ID/root, which could then be owned root:root and mode 0755 like the real root directory.

jclulow · 2026-05-07T08:09:34Z

+    let available_targets = c
+        .config
+        .slots
+        .iter()
+        .filter(|(name, _)| !used_slots.contains(name.as_str()))
+        .map(|(_, slot)| slot.target.clone())
+        /*
+         * Deduplicate the targets by first collecting into a HashSet.
+         */
+        .collect::<HashSet<_>>()
+        .into_iter()
+        .collect::<Vec<_>>();


When determining which targets are available, I think we need to be able to specify some way to check the health of each configured slot. This is a piece that I had not yet completed for the hubris factory, but I think is relatively critical: we need to be able to check for the presence of the expected set of USB devices (debug probes, serial ports, etc) prior to taking a lease from the server. Otherwise, it seems likely that some of the time we'll have broken slots that absorb and then fail jobs, especially when we have more than one slot on a system with different probes.

I was planning on deferring health checking in a future PR: it doesn't strictly block deploying an MVP of the Hubris hardware CI, and there are some alternate ideas I have on how to possibly implement this. Would it be ok to defer health checking on a future PR?

The alternate idea I had was to delegate the health checking to the job itself, and adding a bmat worker mark-broken -m "message" command that marks the worker as failed and putting it on hold. The hold would both alert operators (once I implement monitoring for held workers) and keep the slot reserved, preventing other jobs for starting on it until the hold is released.

It would be ok for user-factory to take on the responsibility of assigning a list of system resources as defined in some pool and required by some slot. The workaround I'm using right now is to set the devices up as owned by a group. See the note elsewhere about additional groups not being set in the current commit.

Resources include (all optional depending on the testbed): SP probe, RoT probe, USB to serial device, power control, IPv6 network access to an SP. We could add logic probes or other devices as well in certain cases.

The workaround I'm using right now is to set the devices up as owned by a group.

That's the core of my design for the user factory. For Hubris CI those resources are required, yes, but other uses of the factory in the future might need different resources and I kinda don't want to keep expanding the set of devices the factory understands.

jclulow · 2026-05-07T08:12:50Z

+    #[serde(default)]
+    pub(crate) add_to_groups: Vec<String>,
+    #[serde(default)]
+    pub(crate) env: HashMap<String, String>,


Do you have an example configuration that includes all the environment variables you'd be specifying through this mechanism?

lzrd · 2026-05-07T21:11:57Z

Group id issue for USB device ownership:

With a slot's add_to_groups = ['staff'], the ephemeral worker process gets EACCES on
resources owned by 0660 root:staff. /etc/group correctly records the membership; the kernel
credential of the running process does not.

factory/user/src/bootstrap_agent.rs uses std::os::unix::process::CommandExt::uid()/gid(), which sets the primary uid/gid via setresuid/setresgid pre-exec but never calls setgroups(2)/initgroups(3C).
The kernel credential at exec inherits the factory daemon's supplementary group list (empty for root-without-supps).

Fix: a pre_exec hook calling libc::initgroups(user, primary_gid) just before exec in
bootstrap_agent.rs (and wherever the agent then forks the job script, if separate).

Workaround: install everything at world-traversable system paths (/opt/...) so workers don't
depend on supplementary groups. Sidesteps but doesn't solve.

Co-Authored-By: Joshua M. Clulow <jmc@oxide.computer>

emilyalbini requested a review from jclulow May 6, 2026 17:12

lzrd reviewed May 7, 2026

View reviewed changes

Comment thread factory/user/src/config.rs

Comment thread factory/user/README.md Outdated

jclulow reviewed May 7, 2026

View reviewed changes

emilyalbini force-pushed the ea-user-factory branch 2 times, most recently from a9266cd to f39679b Compare May 7, 2026 10:41

emilyalbini added 8 commits May 7, 2026 14:49

agent: add -e flag to set env vars

81b01f9

agent: error out if too many arguments are passed

ca6f531

agent: avoid cfgs during install

d41fed8

agent: add flag to avoid setting up a service during install

d0811a7

agent: don't hardcode the root user

85a414a

agent: allow running without setuid privileges

35a2cd4

agent: install bmat in /opt/buildomat/bin

142848c

server: add endpoint for a factory to fail a worker

c5beb65

emilyalbini force-pushed the ea-ptqwnqpswsuv branch from 2683833 to f2bcec5 Compare May 7, 2026 18:42

emilyalbini force-pushed the ea-user-factory branch from f39679b to 4dc2397 Compare May 7, 2026 18:42

user: implement user factory

290b367

Co-Authored-By: Joshua M. Clulow <jmc@oxide.computer>

emilyalbini force-pushed the ea-user-factory branch from 4dc2397 to 290b367 Compare May 8, 2026 14:33

	enum Root {
	Global,
	PerUser(PathBuf),
	}

	impl Root {
	fn etc(&self) -> PathBuf {
	match self {
	Root::Global => "/opt/buildomat/etc".into(),
	Root::PerUser(top) => top.join("etc"),
	}
	}

	fn lib(&self) -> PathBuf {
	match self {
	Root::Global => "/opt/buildomat/lib".into(),
	Root::PerUser(top) => top.join("lib"),
	}
	}

	fn usrbin(&self) -> PathBuf {
	match self {
	Root::Global => "/usr/bin".into(),
	Root::PerUser(top) => top.join("bin"),
	}
	}

	pub fn config_path(&self) -> PathBuf {
	self.etc().join("agent.json")
	}

	pub fn job_path(&self) -> PathBuf {
	self.etc().join("job.json")
	}

	pub fn agent(&self) -> PathBuf {
	self.lib().join("agent")
	}

	pub fn control_program(&self) -> PathBuf {
	self.usrbin().join(CONTROL_PROGRAM)
	}

	pub fn should_install_service(&self) -> bool {
	match self {
	Root::Global => true,
	Root::PerUser(_) => false,
	}
	}

	pub fn unprivileged(&self) -> bool {
	match self {
	Root::Global => false,
	Root::PerUser(_) => true,
	}
	}
	}

	<method_context>
	<method_credential user='build' group='build' />
	</method_context>

Conversation

emilyalbini commented May 6, 2026

Uh oh!

lzrd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jclulow left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lzrd commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants