INDEX
Explanations
mentions of political figures, especially negative references and critiques
occurrences of the word "and."
New Auto-Interp
Negative Logits
inarily
-0.76
ridges
-0.68
note
-0.67
Contents
-0.65
culus
-0.65
floor
-0.65
physical
-0.64
different
-0.64
IED
-0.64
ENS
-0.63
POSITIVE LOGITS
Associates
0.90
vice
0.83
others
0.77
ERSON
0.76
Sons
0.74
associates
0.73
assorted
0.71
consequently
0.69
other
0.69
then
0.68
Activations Density 0.336%