You seem to be ignoring the potential to use this to improve the performance of ...

You seem to be ignoring the potential to use this to improve the performance of LLMs. If you can unlearn wrong answers you can ask the model using any scoring mechanism to check for correctness instead of scoring for token for token similarity to the prescribed answer.