INDEX
Explanations
propaganda and narrative control
New Auto-Interp
Negative Logits
misuse
0.45
ڈیٹ
0.42
inappropriately
0.40
verme
0.40
indiscrimin
0.39
inm
0.38
不起
0.38
unpredictable
0.38
agon
0.38
inappropri
0.38
POSITIVE LOGITS
propaganda
1.25
Propaganda
0.97
宣传
0.96
propagand
0.96
narrative
0.94
propag
0.86
narrativa
0.84
narratives
0.82
Narrative
0.82
propagated
0.79
Activations Density 0.036%