INDEX
Explanations
pronouns and verb phrases that indicate actions taken by people
New Auto-Interp
Negative Logits
ãĥ©ãĤ¹
-0.07
æIJ¬
-0.07
erner
-0.06
anj
-0.06
ç´
-0.06
lech
-0.06
ESC
-0.06
aus
-0.06
astle
-0.06
ÙĪØ§Ùĩ
-0.06
POSITIVE LOGITS
doing
0.11
done
0.10
Doing
0.09
Doing
0.09
best
0.09
doing
0.08
best
0.08
always
0.07
Done
0.07
did
0.07
Activations Density 0.011%