INDEX
Explanations
phrases related to beliefs, opinions, and actions
phrases related to belief and social actions
New Auto-Interp
Negative Logits
interstitial
-0.74
iven
-0.68
%.
-0.67
'.
-0.67
kson
-0.63
hyde
-0.61
zik
-0.59
'.
-0.58
AKING
-0.57
nown
-0.57
POSITIVE LOGITS
pires
0.79
weren
0.74
aren
0.73
were
0.70
leground
0.67
are
0.65
hran
0.65
estern
0.65
alg
0.60
erate
0.60
Activations Density 0.339%