INDEX
Explanations
phrases indicating independence or self-sufficiency
New Auto-Interp
Negative Logits
Own
-0.18
own
-0.18
theirs
-0.18
Own
-0.17
own
-0.17
Yours
-0.17
OWN
-0.16
yours
-0.16
longleftrightarrow
-0.15
poc
-0.14
POSITIVE LOGITS
enor
0.15
elsing
0.15
ocket
0.14
ubb
0.14
orthand
0.14
chas
0.14
endid
0.13
lys
0.13
osa
0.13
cha
0.13
Activations Density 0.045%