INDEX
Explanations
mentions of different ideologies and their related concepts
terms related to ideology and its various manifestations
New Auto-Interp
Negative Logits
ells
-0.90
Mamm
-0.81
FACE
-0.74
rooms
-0.74
tub
-0.74
ibli
-0.73
backs
-0.72
theless
-0.71
teen
-0.71
ilet
-0.67
POSITIVE LOGITS
ideology
0.99
indoctr
0.93
affiliation
0.89
eering
0.87
theorist
0.83
theoret
0.83
ide
0.82
guiActiveUn
0.81
ideologies
0.80
purity
0.79
Activations Density 0.017%