INDEX
Explanations
references to danger or hazardous situations
New Auto-Interp
Negative Logits
enance
-0.16
noqa
-0.16
erez
-0.16
å±
-0.15
arity
-0.15
CALE
-0.15
NEY
-0.15
éϵ
-0.14
ting
-0.14
verts
-0.14
POSITIVE LOGITS
-danger
0.20
dangerous
0.17
unsafe
0.16
ously
0.16
danger
0.15
vous
0.15
unsafe
0.15
dangers
0.14
åı£
0.14
Dangerous
0.14
Activations Density 0.037%