INDEX
Explanations
disinformation and misinformation
New Auto-Interp
Negative Logits
北海道
0.54
сного
0.53
家族
0.51
homozyg
0.51
льного
0.51
recorrido
0.50
ገል
0.50
त्ति
0.48
夗
0.47
pregnancies
0.47
POSITIVE LOGITS
disinformation
0.73
you
0.64
filtering
0.64
Propaganda
0.61
misinformation
0.61
we
0.59
propaganda
0.59
your
0.58
countermeasures
0.58
Markt
0.57
Activations Density 0.093%