INDEX
Explanations
references to safety and danger
New Auto-Interp
Negative Logits
omu
-0.16
HAM
-0.15
wheel
-0.15
¶Į
-0.15
ETCH
-0.15
ham
-0.15
격
-0.14
İM
-0.14
Rotor
-0.14
RATE
-0.13
POSITIVE LOGITS
safety
0.25
unsafe
0.24
Safety
0.23
unsafe
0.23
risks
0.22
afety
0.21
safer
0.21
risky
0.21
Safety
0.21
dangerous
0.21
Activations Density 0.083%