INDEX
Explanations
phrases that express comparisons or contrasts
New Auto-Interp
Negative Logits
UNUSED
-0.17
Anc
-0.16
Orc
-0.15
ople
-0.15
croft
-0.14
ooky
-0.14
allee
-0.14
397
-0.14
rome
-0.13
osp
-0.13
POSITIVE LOGITS
artment
0.15
ksen
0.15
aycast
0.14
æ¿
0.14
NGTH
0.14
lements
0.14
olib
0.14
igar
0.14
å°Ķ
0.14
olf
0.13
Activations Density 0.119%