INDEX
Explanations
references to personal experiences and emotional reactions
New Auto-Interp
Negative Logits
ses
-0.27
bidden
-0.21
pired
-0.20
/or
-0.20
tempts
-0.18
pires
-0.18
cribed
-0.18
woke
-0.18
ductive
-0.17
quired
-0.17
POSITIVE LOGITS
orem
0.51
oret
0.34
oretical
0.30
ories
0.26
semble
0.26
notated
0.25
/Set
0.23
grily
0.22
/Edit
0.22
/Sub
0.21
Activations Density 5.273%