INDEX
Explanations
references to individuals and groups in a variety of contexts
New Auto-Interp
Negative Logits
itself
-0.18
taire
-0.18
cliffe
-0.18
onaut
-0.17
imary
-0.16
liš
-0.15
ress
-0.15
resse
-0.15
arah
-0.15
ï
-0.15
POSITIVE LOGITS
/us
0.35
/her
0.34
zelf
0.28
-même
0.24
SELF
0.23
/th
0.23
atically
0.22
self
0.21
/we
0.21
etics
0.19
Activations Density 0.150%