INDEX
Explanations
phrases related to danger or risk
New Auto-Interp
Negative Logits
enance
-0.19
CALE
-0.18
trá»Ŀi
-0.17
arity
-0.15
noqa
-0.15
NEY
-0.15
å±
-0.15
tle
-0.15
boom
-0.14
verts
-0.14
POSITIVE LOGITS
-danger
0.20
ously
0.18
dangerous
0.17
unsafe
0.17
unsafe
0.16
dangers
0.15
danger
0.15
ä¸Ķ
0.15
ous
0.14
Dangerous
0.14
Activations Density 0.031%