INDEX
Explanations
statements related to disclosure or justification
discussions centered around honesty and revelation
New Auto-Interp
Negative Logits
phalt
-0.76
rians
-0.70
atl
-0.70
croft
-0.68
agues
-0.67
rian
-0.66
erva
-0.66
iatrics
-0.65
enf
-0.65
onder
-0.64
POSITIVE LOGITS
his
1.07
their
1.03
herself
0.93
her
0.89
owning
0.87
what
0.86
how
0.86
their
0.86
himself
0.85
quitting
0.85
Activations Density 0.401%