INDEX
Explanations
recommendations for safety and prevention measures
New Auto-Interp
Negative Logits
ego
-0.15
anz
-0.15
scal
-0.14
porto
-0.14
erland
-0.14
коз
-0.14
PixelFormat
-0.14
maal
-0.14
Tunnel
-0.14
ancybox
-0.14
POSITIVE LOGITS
never
0.20
avoided
0.20
avoid
0.19
Avoid
0.19
NEVER
0.19
avoiding
0.19
familiar
0.18
avoidance
0.17
avoids
0.17
always
0.17
Activations Density 0.078%