INDEX
Explanations
phrases indicating assumptions or beliefs
phrases emphasizing assumptions or beliefs about societal issues
New Auto-Interp
Negative Logits
guard
-0.78
ensed
-0.71
WER
-0.71
yna
-0.70
backer
-0.70
arthed
-0.69
inar
-0.69
eng
-0.69
hm
-0.68
AZ
-0.66
POSITIVE LOGITS
somehow
0.90
someday
0.82
everyone
0.81
everything
0.79
they
0.76
rationality
0.75
anyone
0.74
these
0.72
abandoning
0.71
justifies
0.69
Activations Density 0.192%