INDEX
Explanations
positive or optimistic language
expressions emphasizing positive outcomes or sentiments
New Auto-Interp
Negative Logits
RAW
-0.92
arij
-0.79
ANS
-0.77
ITNESS
-0.74
jack
-0.72
oths
-0.72
ARS
-0.71
library
-0.70
Corn
-0.69
conservancy
-0.69
POSITIVE LOGITS
positive
1.17
reinforcement
1.01
positive
1.00
Positive
0.98
affirm
0.91
affirmation
0.90
feedback
0.88
negative
0.85
Negative
0.85
positives
0.85
Activations Density 0.015%