INDEX
Explanations
unsafe or dangerous situations or conditions
references to safety and conditions deemed unsafe
New Auto-Interp
Negative Logits
frey
-0.86
braska
-0.84
orah
-0.83
yss
-0.82
issance
-0.81
ership
-0.80
hunt
-0.80
gdala
-0.79
cence
-0.78
anche
-0.77
POSITIVE LOGITS
unsafe
1.10
safe
0.73
adolesc
0.73
Ez
0.68
hazardous
0.68
Msg
0.64
Safe
0.64
paste
0.63
NESS
0.63
risky
0.62
Activations Density 0.007%