INDEX
Explanations
words related to propaganda
references to propaganda and related concepts
New Auto-Interp
Negative Logits
ergy
-0.94
Course
-0.77
clud
-0.76
semble
-0.71
earth
-0.70
slice
-0.65
joining
-0.65
count
-0.65
20439
-0.65
Earth
-0.64
POSITIVE LOGITS
aganda
1.22
propaganda
1.09
disinformation
0.93
leaflets
0.90
abwe
0.86
suppression
0.84
guiActiveUn
0.83
propag
0.82
indoctr
0.82
ãĤ¸
0.81
Activations Density 0.010%