INDEX
Explanations
phrases or words that convey positive attributes or actions
terminology related to positive outcomes or effects
New Auto-Interp
Negative Logits
appings
-0.81
ptin
-0.80
oths
-0.79
puter
-0.79
adr
-0.78
conservancy
-0.76
neys
-0.74
ngth
-0.73
RAW
-0.73
arers
-0.72
POSITIVE LOGITS
reinforcement
1.05
affirm
0.92
affirmation
0.90
feedback
0.84
outcome
0.83
vib
0.83
positive
0.80
portrayal
0.79
outlook
0.77
appraisal
0.76
Activations Density 0.028%