INDEX
Explanations
names and titles mentioning specific individuals or institutions, particularly focusing on surnames
words related to names or identities
New Auto-Interp
Negative Logits
steep
-0.57
reality
-0.56
conservative
-0.56
CODE
-0.55
scra
-0.54
FUL
-0.54
broadcasters
-0.54
conservatives
-0.54
PLAY
-0.53
compassionate
-0.53
POSITIVE LOGITS
i
1.78
a
1.58
icz
1.52
o
1.49
e
1.33
aan
1.26
oa
1.24
ski
1.23
xual
1.22
ti
1.22
Activations Density 0.298%