INDEX
Explanations
references to film, movies, and visual media
New Auto-Interp
Negative Logits
all
-0.17
ito
-0.16
ald
-0.16
yor
-0.15
elling
-0.15
ell
-0.15
Ìī
-0.15
ellite
-0.15
ji
-0.14
elf
-0.14
POSITIVE LOGITS
umin
0.21
abeth
0.20
ustr
0.19
houette
0.18
antro
0.18
inois
0.18
adelphia
0.17
aments
0.17
lá»ĩ
0.17
patrick
0.17
Activations Density 0.111%