INDEX
Explanations
words related to unity and collective action
New Auto-Interp
Negative Logits
himself
-0.93
herself
-0.78
ãĥĹ
-0.60
lucrative
-0.57
reportedly
-0.57
disliked
-0.55
frowned
-0.55
revoked
-0.54
his
-0.53
infuri
-0.52
POSITIVE LOGITS
ourselves
1.83
our
1.71
Our
1.70
Our
1.64
we
1.58
We
1.55
We
1.54
OUR
1.53
we
1.44
ours
1.32
Activations Density 0.937%