INDEX
Explanations
references to political figures and their actions
New Auto-Interp
Negative Logits
ุà¹Ī
-0.15
ulle
-0.15
ella
-0.15
Liv
-0.14
ope
-0.14
stack
-0.14
stack
-0.14
aroo
-0.14
Goods
-0.14
run
-0.13
POSITIVE LOGITS
èħ
0.16
รร
0.15
/by
0.15
İli
0.15
duk
0.15
é¬
0.15
cov
0.14
Giang
0.14
ppers
0.14
CID
0.14
Activations Density 0.065%