INDEX
Explanations
phrases related to threats or dangers
New Auto-Interp
Negative Logits
rint
-0.15
ehler
-0.15
amin
-0.15
Worm
-0.15
iao
-0.15
arent
-0.15
agues
-0.14
.gstatic
-0.14
olest
-0.14
ledon
-0.14
POSITIVE LOGITS
dangers
0.19
danger
0.18
baar
0.16
çĬ¶
0.15
lug
0.15
elm
0.15
132
0.15
ources
0.15
ably
0.14
Danger
0.14
Activations Density 0.029%