INDEX
Explanations
instances of personal pronouns, particularly "I" and "we"
New Auto-Interp
Negative Logits
fol
-0.15
CC
-0.15
chin
-0.15
Loving
-0.14
-loving
-0.14
altern
-0.14
spray
-0.14
ervlet
-0.14
Duffy
-0.14
oplan
-0.14
POSITIVE LOGITS
iag
0.16
iams
0.15
ENER
0.15
aits
0.15
uron
0.15
sop
0.14
XHR
0.14
mland
0.14
SENS
0.14
ieu
0.14
Activations Density 0.390%