INDEX
Explanations
phrases related to positivity and favorable outcomes
references to positive outcomes or sentiments
New Auto-Interp
Negative Logits
oths
-0.85
ngth
-0.82
atum
-0.81
puter
-0.81
ptin
-0.79
appings
-0.76
adr
-0.73
alian
-0.72
hid
-0.72
neys
-0.72
POSITIVE LOGITS
reinforcement
1.02
outlook
0.96
outcome
0.95
vib
0.94
feedback
0.94
affirm
0.89
affirmation
0.86
appraisal
0.85
outcomes
0.84
attitude
0.84
Activations Density 0.041%