INDEX
Explanations
references to political figures and their actions
New Auto-Interp
Negative Logits
ATRIX
-0.18
ëŀľëĵľ
-0.17
diren
-0.17
cestor
-0.15
raquo
-0.15
addCriterion
-0.15
istring
-0.14
tá»ij
-0.14
restau
-0.14
OKIE
-0.14
POSITIVE LOGITS
ÃĤ
0.18
â
0.17
_
0.16
(
0.16
said
0.16
gu
0.15
[â̦
0.15
0.15
,
0.15
â
0.15
Activations Density 0.038%