INDEX
Explanations
phrases that refer to second-person pronouns or direct address
New Auto-Interp
Negative Logits
resse
-0.16
arten
-0.15
asz
-0.14
↵↵
-0.14
lest
-0.14
erness
-0.13
penÄĽ
-0.13
-cart
-0.13
.pb
-0.13
tics
-0.13
POSITIVE LOGITS
know
0.56
Know
0.46
know
0.42
Know
0.41
knows
0.41
KNOW
0.34
çŁ¥éģĵ
0.31
-know
0.28
зна
0.28
çŁ¥
0.27
Activations Density 0.079%