INDEX
Explanations
references to political figures and their associated actions
New Auto-Interp
Negative Logits
mb
-0.15
argon
-0.15
alker
-0.14
avig
-0.14
instrument
-0.14
MB
-0.14
hollow
-0.14
andin
-0.14
ationale
-0.14
otions
-0.14
POSITIVE LOGITS
ti
0.21
sis
0.20
tain
0.20
si
0.19
ture
0.19
tures
0.19
tu
0.18
bil
0.18
ni
0.17
tle
0.17
Activations Density 0.012%