INDEX
Explanations
references to overarching concepts or themes
New Auto-Interp
Negative Logits
zan
-0.17
ulse
-0.16
ëĭµ
-0.16
het
-0.15
jen
-0.15
ists
-0.15
light
-0.14
quelle
-0.14
coni
-0.14
ube
-0.14
POSITIVE LOGITS
thing
0.30
heart
0.28
entire
0.25
-hearted
0.23
ench
0.21
thing
0.20
-sale
0.20
Thing
0.20
/part
0.19
meal
0.19
Activations Density 0.020%