INDEX
Explanations
references to specific names or terms associated with identity, particularly personal or cultural identification
New Auto-Interp
Negative Logits
xeb
-0.17
ivan
-0.17
icy
-0.15
overn
-0.15
483
-0.15
winds
-0.14
aar
-0.14
ewn
-0.14
buie
-0.14
haps
-0.14
POSITIVE LOGITS
Fi
0.18
patrick
0.16
ancial
0.16
tering
0.16
orent
0.15
gerald
0.15
afa
0.15
zelf
0.15
kers
0.15
OLL
0.14
Activations Density 0.031%