INDEX
Explanations
statements reporting feelings or experiences
New Auto-Interp
Negative Logits
ysz
-0.17
loit
-0.15
iena
-0.15
aines
-0.15
odel
-0.14
rary
-0.14
edback
-0.14
idot
-0.14
thood
-0.14
oyer
-0.14
POSITIVE LOGITS
anean
0.17
conf
0.16
My
0.16
CTS
0.15
Nu
0.14
blow
0.14
nou
0.14
Nu
0.14
pressures
0.14
nu
0.14
Activations Density 0.186%