INDEX
Explanations
references to individuals and their identities
New Auto-Interp
Negative Logits
spl
-0.15
kowski
-0.14
Spl
-0.14
itemprop
-0.14
opy
-0.14
_physical
-0.14
Py
-0.14
essor
-0.14
claims
-0.14
obil
-0.13
POSITIVE LOGITS
fav
0.16
GLOSS
0.15
atz
0.15
ipers
0.14
323
0.14
oslav
0.14
ghan
0.13
anchors
0.13
trained
0.13
biz
0.13
Activations Density 0.036%