INDEX
Explanations
terms related to negative events or actions
New Auto-Interp
Negative Logits
ellen
-0.64
eatures
-0.63
Different
-0.61
ptin
-0.60
cription
-0.60
cylinders
-0.60
adjusting
-0.59
meric
-0.59
ynthesis
-0.57
ready
-0.57
POSITIVE LOGITS
embarrassment
1.15
endanger
1.08
harm
1.04
angering
1.04
jeopard
1.03
inconvenience
0.97
fate
0.96
havoc
0.95
tragedies
0.95
consequences
0.93
Activations Density 0.538%