INDEX
Explanations
political references and statements by politicians
New Auto-Interp
Negative Logits
ãĥ¥
-0.68
worldly
-0.67
Russ
-0.65
zn
-0.63
Interested
-0.61
ARCH
-0.61
ARY
-0.59
oir
-0.58
âĶĢâĶĢâĶĢâĶĢ
-0.57
',
-0.57
POSITIVE LOGITS
withdrew
0.82
joins
0.82
denies
0.80
was
0.74
wrote
0.74
appears
0.74
became
0.74
declined
0.73
teaches
0.73
opposes
0.72
Activations Density 0.943%