INDEX
Explanations
words related to specific names or identities
New Auto-Interp
Negative Logits
oses
-0.76
istics
-0.64
astically
-0.59
ournal
-0.58
ANK
-0.58
ials
-0.57
iary
-0.56
astic
-0.56
______
-0.55
iates
-0.55
POSITIVE LOGITS
lla
1.25
llan
1.17
lli
1.07
llers
1.05
lling
1.05
lda
1.04
ll
1.01
ller
1.00
tta
0.97
hart
0.96
Activations Density 0.125%