INDEX
Explanations
references to liberalism and its various forms and implications
New Auto-Interp
Negative Logits
ersen
-0.15
ey
-0.15
plier
-0.15
uld
-0.15
ese
-0.15
ej
-0.15
iq
-0.14
alian
-0.14
iles
-0.14
oth
-0.14
POSITIVE LOGITS
ised
0.18
ornings
0.16
/lib
0.15
onec
0.15
hift
0.15
ATA
0.14
ising
0.14
uyến
0.14
-leaning
0.14
entiful
0.14
Activations Density 0.008%