INDEX
Explanations
references to liberalism and related ideologies
New Auto-Interp
Negative Logits
antry
-0.18
jÃŃm
-0.18
anke
-0.17
νι
-0.17
orch
-0.16
ean
-0.16
alist
-0.15
onet
-0.15
iers
-0.15
ee
-0.15
POSITIVE LOGITS
ised
0.33
ization
0.32
isation
0.31
ized
0.31
izing
0.29
ize
0.28
ising
0.27
ism
0.27
izes
0.27
ises
0.23
Activations Density 0.022%