INDEX
Explanations
words related to danger or threats
phrases related to various forms of risk or danger
New Auto-Interp
Negative Logits
elf
-0.65
MQ
-0.64
actionGroup
-0.61
ovy
-0.61
zos
-0.60
chet
-0.60
olon
-0.60
ILA
-0.58
oun
-0.58
olver
-0.56
POSITIVE LOGITS
financially
0.88
of
0.78
angering
0.77
because
0.73
bh
0.71
lest
0.69
posed
0.69
unless
0.66
endanger
0.65
from
0.65
Activations Density 0.065%