They are just trying to find a way to plausibly declare successful removal of copyrighted and/or illegal material without discarding weights.
GPT-4 class models reportedly costs $10-100m to train, and that's too much to throw away for Harry Potter or Russian child porn scrapes that could later reproduce verbatim despite representing <0.1ppb or whatever minuscule part of dataset.
GPT-4 class models reportedly costs $10-100m to train, and that's too much to throw away for Harry Potter or Russian child porn scrapes that could later reproduce verbatim despite representing <0.1ppb or whatever minuscule part of dataset.