INDEX
Explanations
references to government officials and their actions
New Auto-Interp
Negative Logits
apult
-0.18
fall
-0.18
bilt
-0.16
ovan
-0.15
ofilm
-0.15
/Typography
-0.15
lings
-0.15
PLY
-0.15
our
-0.15
illo
-0.15
POSITIVE LOGITS
dom
0.28
844
0.20
dehyde
0.17
istra
0.17
/legal
0.17
ity
0.17
stration
0.16
ivec
0.16
ÙĤات
0.16
most
0.16
Activations Density 0.012%