INDEX
Explanations
names and references to specific individuals, particularly in the context of movies or public figures
New Auto-Interp
Negative Logits
wand
-0.17
703
-0.15
zin
-0.14
Achilles
-0.14
ull
-0.14
surprised
-0.14
yang
-0.14
11
-0.14
pip
-0.14
afa
-0.14
POSITIVE LOGITS
argin
0.17
emez
0.16
prites
0.16
sami
0.15
abor
0.15
ustos
0.15
bette
0.15
PROCUREMENT
0.15
ampa
0.14
sam
0.14
Activations Density 0.051%