INDEX
Explanations
words related to negative attributes or consequences
negative phrases related to health or unfavorable conditions
New Auto-Interp
Negative Logits
Sharing
-0.66
rings
-0.63
Shooter
-0.62
Nope
-0.61
endlessly
-0.61
illon
-0.60
Rouge
-0.60
£ı
-0.59
Noir
-0.59
ulhu
-0.58
POSITIVE LOGITS
gotten
1.27
fitting
1.15
equipped
1.13
founded
1.10
defined
1.10
treatment
1.06
informed
1.04
fortune
0.98
intent
0.96
treated
0.96
Activations Density 0.028%