INDEX
Explanations
instances of self-reflection and acknowledgment in personal experiences
New Auto-Interp
Negative Logits
en
-0.52
on
-0.51
-0.50
th
-0.50
бок
-0.50
mb
-0.49
t
-0.49
<eos>
-0.49
h
-0.49
前
-0.48
POSITIVE LOGITS
admit
1.31
approve
1.05
myſelf
1.03
accept
1.03
acknowledge
1.01
recognise
1.00
Efq
1.00
itſelf
1.00
recognize
0.99
agree
0.96
Activations Density 0.129%