INDEX
Explanations
phrases related to political figures or events
New Auto-Interp
Negative Logits
PB
-0.70
RELE
-0.69
sid
-0.68
WW
-0.68
DEC
-0.66
OW
-0.65
hig
-0.64
DIRECT
-0.63
ASP
-0.63
AUD
-0.63
POSITIVE LOGITS
antes
1.10
idation
1.03
opia
1.03
ansion
1.02
idy
1.01
orks
1.01
ois
1.00
gettable
1.00
olic
0.99
rium
0.99
Activations Density 0.260%