One approach being considered is "AI Safety Via Debate"[0], which hopes to prevent deception by carefully constructing games in which a superhuman agent's best strategy is honesty. Note that this is the goal; much work to be done!
Forget AIs - we need this for humans to design legal and administrative systems.
I have pondered if it would be a workable field to have incentive based design in a formalized way to ensure that even a complete sociopath would find acting in a beneficial way the best option.
Do we know the entire game theory well enough so that we can structure such games with no theoretical way for AI to sneak out? I doubt that, but even so, funny things start happening when theory meets practice. I recall the example of quantum entanglement, which (I read) enables communications that cannot be spied upon without the intended parties knowing. Except, (I also read) it was attacked at the interface between quantum and classical domain. The world is complex, and superhuman AI is by definition better equipped to find loopholes than humans are.
Unfortunatley being dishonest or evil is just one example. Arguably the AI can develop new classes of deviancy, abuse or maladaptation that we haven't conceptualized yet. We supersize the ability, surely we supersize the problems.
It leads to a scary question: what does a superhuman AI really want?
To be fair a HFT agent can count as superhuman AI technically. Wanting isn't a thing that applies yet to actual AI and there is no special sauce that indicates advancement beyond neuron scale. Barring directives and assuming "grown" what it wants can be utterly peripheral to rationality and likely based on what it is taught - internationally or not. Look at how society preaches honesty from a young age and then starts teaching lying again by rewarding it. The real lesson is the spartan one on stealing- don't get caught. It may not be intended but it is the result.
> which hopes to prevent deception by carefully constructing games in which a superhuman agent's best strategy is honesty
I'd be very hesitant to assume that an agent cannot learn under which circumstances it should be honest to gain a benefit without putting any innate value on honesty. A human agent is more than capable of reasoning like that, let alone a superhuman one.
[0] https://arxiv.org/abs/1805.00899