INDEX
Explanations
words associated with negative conditions or outcomes
phrases related to negative evaluations or descriptions of situations
New Auto-Interp
Negative Logits
shutter
-0.65
thouse
-0.64
zees
-0.62
Rouge
-0.61
crumble
-0.61
Sharing
-0.60
lots
-0.60
rapists
-0.58
Closed
-0.58
Delete
-0.58
POSITIVE LOGITS
gotten
1.34
fitting
1.22
treatment
1.19
fortune
1.12
intent
1.06
founded
1.06
informed
1.05
defined
1.05
effects
1.02
equipped
1.02
Activations Density 0.043%