INDEX
Explanations
words related to propaganda and opinion-sharing
New Auto-Interp
Negative Logits
reat
-0.68
ritic
-0.66
yip
-0.64
knife
-0.62
pperc
-0.62
atro
-0.62
atches
-0.62
thren
-0.60
FH
-0.60
rients
-0.58
POSITIVE LOGITS
ocating
1.03
uding
1.02
usion
0.95
uring
0.94
sorts
0.90
ocated
0.87
kinds
0.81
ocation
0.81
usions
0.80
owing
0.79
Activations Density 0.045%