INDEX
Explanations
phrases related to personal experiences, opinions, and actions in various situations
New Auto-Interp
Negative Logits
selves
-0.87
unison
-0.82
hub
-0.79
respective
-0.69
respectively
-0.63
merce
-0.58
Authors
-0.58
ourselves
-0.57
mination
-0.57
Helpful
-0.57
POSITIVE LOGITS
himself
1.77
Himself
1.19
his
1.15
herself
1.01
charisma
0.80
personally
0.80
subordinates
0.79
persona
0.76
wife
0.76
His
0.74
Activations Density 4.563%