At OSCI we’re looking at the container world to help us host services for our communities in an easier way. We’ve been using VMs via libvirt for most of our workload and that works well but there’s specific features that are embedded in the container workflow and are really interesting. I especially like that updates are made from scratch, no left-over from a previous deployment, and also the readiness/liveliness checks which is more proactive than traditional monitoring.

Containers introduce new ways to run applications and the current trend is to run one process per container. Unfortunately that’s simply not possible for most workloads because existing software are not architectured to work that way. Even running a single binary often results in forks to drop capabilities, or multiple forks to spawn workers.

Moreover not everyone agrees on this model, seeing running a service manager as PID 1 as now part of the UNIX API. It’s true that systemd nowadays does more than just ripping zombie processes. The way a service needs to be spawn is clearly defined in services files and there’s no need to reinvent the wheel.

Anyway, Jon Trossbach, our former intern, worked on containerization of postfix and he concluded that a major overhaul of the software design would need to happen to adapt to this new model. Even if you cannot enjoy all the benefits of the container model you may still wish to use its workflow and some of its features, that’s why I’ve been experimenting with our OpenShift Dedicated account to make this use case functional.

Initial tribulations

I decided to create a demo site for LDAPWalker, a shell-like CLI for LDAP operations. Thanks to ttyd I am able to display an interactive shell with the application. To make the demo more enjoyable I decided to host a writable OpenLDAP instance so you could try almost all commands.

The demo uses Debian images but the official one simply does not integrate systemd and I initially chose to use John Goerzen’s minimal image for that purpose.

Even with the most simple Dockerfile the container failed to start and there was no output. I was puzzled but after some digging I discovered the tty option in the container’s parameters of my DeploymentConfig. The configuration looked like this:

apiVersion: v1
kind: DeploymentConfig
  name: ldap-server
  namespace: prod-ldapwalker-demo-duckcorp-org
    type: Recreate
  replicas: 1
    name: ldap-server-container
      creationTimestamp: null
        name: ldap-server-container
        - name: ldap-server
            - containerPort: 389
              protocol: TCP
          tty: true
          image: ' '
    - type: ImageChange
        automatic: true
          - ldap-server
          kind: ImageStreamTag
          namespace: prod-ldapwalker-demo-duckcorp-org
          name: 'ldap-server:latest'
    - type: ConfigChange

Now I could see the output:

Starting systemd
Failed to mount tmpfs at /run: Operation not permitted
Failed to mount tmpfs at /run/lock: Operation not permitted
[!!!!!!] Failed to mount API filesystems.
Exiting PID 1...

Setup for systemd, do I need cgroupsv2?

I started looking at mounting the necessary tmpfs and that was easily done but not sufficient. systemd needs to write it own tree of cgroups and that was not possible.

My initial reading suggested mounting /sys/fs/cgroup read-only from the host but, probably for security reasons, that’s not something OpenShift was letting me do.

The more I dug the more I was persuaded I needed cgroupsv2 since it bring a lot of improvements and modern OSes, like the image I chose, are transitioning to a cgroupsv2-only setup. I looked at Fraser Tweedale’s work about cgroupv2 on OpenShift and wanted to follow the steps. Unfortunately to update the MachineConfig in your cluster you need more permissions than a cluster admin holds. I reached out to the support team (like any client would) and asked for help. Making these changes to customize a cluster was not possible and they were sure my problem could be solved with cgroupsv1.

Discussing the problem with the support team I learned there is an heuristic in the cri-o runtime to detect and setup the mounts for systemd but that did not work in my case because in John Goerzen’s the entrypoint is neither /sbin/init nor a path that resolved to a path containing “systemd” (I could not find this documented anywhere though, but the code speaks for itself).

I switched to another image(sources), which only implements minor changes over the official Debian image to add systemd, and enforced the ENTRYPOINT to “/lib/systemd/systemd”. Since OpenShift uses rootless containers by default I had to add the --system parameter to ensure systemd starts as a system manager.

I tried running a test image with oc run pod-systemd -ti --rm --image-pull-policy=Always --image=<test-image> --restart=Never and that worked fine \o/.

Why is it still not working?

Now back to my DeploymentConfig setup and made all the adaptations and… it was a total failure. After some digging I found out that these containers were not run using the same security policy (SCC). In fact to have to switch to a rootful container to be able to run systemd and that was what was happening in my tests. Because I’m a cluster admin I was by default given more leeway in the containers I ran but our deployment is done using a specific account which is not as powerful (we use Ansible to generate the YAML configuration and the API to push it).

In the default OpenShift configuration there is already a role for this use case called system:openshift:scc:anyuid`. I just followed the OpenShift documentation and created a service account as well as a role binding:

apiVersion: v1
kind: ServiceAccount
  name: ldapwalker
apiVersion: v1
kind: RoleBinding
  name: sa-to-scc-anyuid
  namespace: prod-ldapwalker-demo-duckcorp-org
  - kind: ServiceAccount
    name: ldapwalker
  kind: ClusterRole
  name: 'system\:openshift\:scc:anyuid'

Now my deployment is working fine and the service is operational!