INDEX
Explanations
positive or negative evaluative words
positive and negative evaluations or judgments about situations
New Auto-Interp
Negative Logits
aneers
-0.76
ngth
-0.76
arij
-0.76
ividual
-0.71
assemb
-0.69
iries
-0.68
velop
-0.67
rive
-0.67
mop
-0.65
icipated
-0.64
POSITIVE LOGITS
considering
1.09
because
0.95
ðŁĻĤ
0.79
!
0.79
eh
0.78
reasoning
0.77
soType
0.75
news
0.75
because
0.75
advice
0.74
Activations Density 0.160%