I currently run my cloud hosts in a manual-intervention-required configuration for storage. Even the VPNs can’t come up until someone connects and provides a key to the hypervisor to unlock storage and start services1
This is a configuration I intend to continue when setting up redundant community hosting. But I’d like to design something that is more autonomous and reliable than the current plan of copy-paste the key into a terminal. Among other goals I’d like to support easy sharing between admins with individual unlock keys, and single-click automation when unlock is authorized
I can imagine building a (non-cloud) keystore that I trust for this task. I feel safe enough giving my cloud systems an SSH key that allows a connection to that keystore; the client isn’t trusted but the channel is low-noise2 and relatively safe from eavesdropping even if that private key is compromised
I imagine service keys are handled by some system that is itself protected by this storage lock; this keystore would not manage service-level keys. For example, TLS certs might be generated with domain API keys which are protected in storage, but neither the API key nor the certificates themselves would be managed by this keystore
The keystore would manage all hypervisor storage keys and their maintenance, including regular re-keying of (currently unlocked and therefore trusted) systems, rotation-aware backups of the underlying keys that can be unlocked by administrators even when they keystore service down, and notifications when automatically managed keys are not behaving as expected
One way I imagine unlocking might work is an interactive approval. On boot the cloud system connects to the keystore, requests access, and waits for someone to log in and click approve. That administrator provides a key that unlocks the relevant keystore, they keystore forwards it to the waiting cloud systems and then forgets both it and the unlock key. I expect administrators would use a keychain tool to manage their individual unlock credentials
I can imagine a pre-authorized reboot mode, in which an administrator might pre-unlock storage in anticipation of a reboot. The keystore would temporarily cache an unlock key and automatically accept the next unlock request (potentially applying other policy/challenges)3 to minimize downtime and allow asynchronous reboots
For that matter I can imagine a permissive mode, in which non-production systems just get unlocked. Obviously such systems shouldn’t have access to private data4, but for testing and development the keystore could just allow access from anyone who (authorized or otherwise) posses the right SSH private key
I have done zero design or coding work on what a keystore like this is actually built from, though I do have experience with relevant tools. I might be much harder than I like, but I feel fairly confident I can build something that feels safe enough for me. Something that would at least require targeted attacks against me individually to compromise
- This has felt super relevant lately because my VM host is silently rebooting my hypervisor every day or two. They have promised that new hardware is coming and I will be transferred to it. In the mean time I am frequently offline until I notice the reboot and unlock the system. This is one reason I want redundant hosts ↩︎
- A targeted attack to steal the SSH key would be prerequisite to attacks via the keystore SSH channel (which would itself be protected via forced local command) ↩︎
- There is lots of room here for additional challenges that might reduce the risk of unauthorized key disclosure. Time limits are an obvious one; temporary keys are only held for 90 seconds. Others might involve an ephemeral key rotation: Before rebooting the (trusted, post-unlock) cloud system could use the (administrator-unlocked) default unlock key to set an ephemeral key, to be used only during the next reboot and then revoked. The keystore could then provide that single-use key during the next reboot rather than the default key (and the next default-key unlock could remove any expired ephemeral keys without needing to remember them) ↩︎
- Keys for user data stores would not be eligible for permissive mode, to avoid disclosure via bad policy ↩︎