Why is that an issue? Training the tokenizer seems much more straightforward than training the model as it is based on the statistics of the input data. I guess it may take a while for massive datasets, but is calculating the frequencies impossible to be done on a bigger scale?
I’ve trained tokenizers on medium-sized datasets (+5GB of text, although that could be considered small or large depending on who you ask) and have always found training quite fast. As in, it takes a couple minutes.
Maybe if we’re talking terabytes it might not scale as well but so far in my experience training tokenizers has never been an issue. It’s training models that takes ages.
There were days that you need to enter developer mode to turn on pin unlock, and Bluetooth devices came later on. I remember that pain of needing to enter my very long password in each boot. Horrible UX.
I interacted with a recent (a couple of months) Chrome OS Flex device (a regular old PC laptop of a senior that I know and helped to establish a basic computer for him), and it didn’t work on the first boot. It was a huge problem for him to enter the password, especially a strong one. I set up a pin with his birthday, which never worked. It worked only if you logged off the account (but didn’t turned off the computer). But if you start fresh, then you won’t be able to use the pin. I ended up setting his Google password to his birthday with dots and some letters, e.g. Qw25.12.1935, which Google allowed and which worked for him. He enters his password letter by letter and for him it’s a worse UX than before, when he had Windows XP that just boots and has Chrome (outdated with no option to upgrade) installed. But I convinced him this new way of things is better. At least it loads momentarily, which he likes.
And on top of that, built in Bluetooth adapter doesn’t work on that very laptop to connect his Android smartphone. The Bluetooth module works, but it doesn’t with the software for some reason. Brief googling showed me it’s easier to buy usb Bluetooth module and try with it. Which I did, but haven’t checked that yet, as he lives quite far away from me. As of now he uses the laptop somehow.
Systemd-homed is portable.
Still, "Reprovision" the broken userspace for the user.
Local k8s like microshift that does container-selinux like RH / Fedora, with Gnome and Waydroid would be cool to have for the kids.
Podman-desktop (~Docker Desktop) does k8s now.
K8s defaults to blocking containers that run as root now, and there's no mounting thee --privileged docket socket w/ k8s either. Gitea + DroneCI/ACT/ci_runner w/ rootless containers. Gvisor is considered good enough for shared server workloads.
Repo2docker + caching is probably close to "kid proof" or "reproducible".
VScode has "devcontainer.json".
Scipy stacks ( https://jupyter-docker-stacks.readthedocs.io/en/latest/using... ) and Kaggle/docker-python (Google) take how many GB to run locally for users < 13 who we don't afford cloud shells with SSH (Colab with SSH, JupyterHub (TLJH w/ k8s),) for either.