INDEX
Explanations
linguistic expressions related to different ideologies and belief systems
references to various political ideologies
New Auto-Interp
Negative Logits
de
-0.84
upon
-0.80
ten
-0.73
ells
-0.73
wolves
-0.72
bors
-0.70
teen
-0.69
agh
-0.68
pain
-0.68
Interstitial
-0.68
POSITIVE LOGITS
ideology
1.19
guiActiveUn
0.99
indoctr
0.96
theorist
0.92
ideologies
0.91
affiliation
0.85
creed
0.84
ide
0.82
theoret
0.82
yip
0.80
Activations Density 0.008%