INDEX
Explanations
mentions of different ideologies
references to various ideologies
New Auto-Interp
Negative Logits
de
-0.75
upon
-0.73
Mamm
-0.72
ells
-0.72
ten
-0.71
EVA
-0.70
shall
-0.69
upper
-0.67
hap
-0.67
teen
-0.67
POSITIVE LOGITS
ideology
1.09
indoctr
0.96
eering
0.92
theorist
0.89
guiActiveUn
0.89
ideologies
0.83
affiliation
0.83
underpin
0.81
ologue
0.80
ologies
0.80
Activations Density 0.014%