INDEX
Explanations
pronouns and references to personal perspectives in discussions
New Auto-Interp
Negative Logits
apid
-0.15
abler
-0.15
OVE
-0.15
/repos
-0.15
erti
-0.14
adir
-0.14
άκ
-0.14
اخ
-0.14
Å¡tÄĽ
-0.14
apus
-0.13
POSITIVE LOGITS
think
0.64
thinks
0.55
Think
0.52
think
0.52
Think
0.49
feel
0.43
THINK
0.42
believe
0.41
feels
0.40
认为
0.38
Activations Density 0.297%