Concerns with RSPs
RSPs rely on effective capability detection and red-teaming. What would we expect to see in the world if capability detection and red-teaming worked well?
- Model capabilities are discovered quickly after release.
- New jailbreaks are rare.
We unfortunately see neither of these. We published papers for ~2 years after GPT-3 came out with new capabilities. Jailbreaks are discovered all the time.
So, if an RSP relies on these two techniques, they’re assuming that (a) the techniques have gotten a lot better since the last models were released, or (b) that the techniques will work better on larger models.
Ok, fair enough, let’s assume one of the two is true. An excellent addition to the RSPs would be that if we discover this isn’t the case, we roll back the model. If we produce an “unjailbreakable” model, and someone finds a jailbreak (even a very modest one), this serves as an existence proof that our processes don’t work.
An additional step would be to release a weaker (say GPT-3.5/4 level) model that we claim is unjailbreakable, put a large dollar award on it, and see if the public can jailbreak it.