INDEX
Explanations
mentions of torture and interrogation practices
New Auto-Interp
Negative Logits
intrig
-0.16
_blocking
-0.15
.blocks
-0.15
alker
-0.15
åĩĿ
-0.14
ois
-0.14
stalking
-0.14
Griff
-0.14
Traversal
-0.14
Cli
-0.14
POSITIVE LOGITS
torture
0.44
Tort
0.41
TORT
0.38
tort
0.34
interrogation
0.28
tortured
0.28
interrog
0.27
detainees
0.20
detain
0.20
techniques
0.19
Activations Density 0.021%