INDEX
Explanations
phrases indicating a historical or factual context
New Auto-Interp
Negative Logits
politics
-0.70
fo
-0.70
rene
-0.65
trap
-0.65
marg
-0.65
ben
-0.64
get
-0.63
Panda
-0.63
aiden
-0.62
pez
-0.62
POSITIVE LOGITS
soever
1.27
we
0.70
they
0.69
xual
0.68
he
0.67
izens
0.66
=-=-=-=-=-=-=-=-
0.65
ippi
0.64
she
0.64
ordan
0.63
Activations Density 0.386%