INDEX
Explanations
tokens that refer to individuals or personal identities
New Auto-Interp
Negative Logits
agar
-0.16
ts
-0.16
usted
-0.15
logan
-0.15
tee
-0.15
Individuals
-0.15
-ie
-0.15
tees
-0.14
iais
-0.14
gars
-0.14
POSITIVE LOGITS
nal
0.33
nel
0.31
ality
0.29
ified
0.27
ajes
0.25
aggi
0.25
ae
0.24
nell
0.24
nels
0.24
ifying
0.23
Activations Density 0.007%