INDEX
Explanations
words related to political and economic issues
New Auto-Interp
Negative Logits
bender
-0.78
puff
-0.71
cum
-0.70
tar
-0.69
icter
-0.66
conom
-0.66
wic
-0.65
wrap
-0.64
ussen
-0.64
more
-0.64
POSITIVE LOGITS
selves
1.38
own
1.20
ancestors
1.02
beloved
1.00
selves
0.95
ourselves
0.93
asses
0.93
adversaries
0.92
collective
0.90
hearts
0.88
Activations Density 0.315%