INDEX
Explanations
praise or positive sentiment expressed towards oneself
the pronoun "I" in various contexts, indicating personal expression and sentiment
New Auto-Interp
Negative Logits
provocation
-0.77
violations
-0.76
contamin
-0.71
unspecified
-0.69
jurisd
-0.69
violating
-0.69
delinquent
-0.68
odder
-0.66
methodological
-0.66
obfusc
-0.64
POSITIVE LOGITS
congratulate
1.46
adore
1.33
thank
1.31
appreciate
1.27
love
1.22
enjoyed
1.21
commend
1.21
cherish
1.20
loved
1.19
udos
1.18
Activations Density 0.280%