INDEX
Explanations
phrases indicating self-involvement or self-referential actions
phrases suggesting actions of self-identification or self-reference
New Auto-Interp
Negative Logits
gap
-0.70
heny
-0.70
Feature
-0.67
grade
-0.65
lav
-0.63
jug
-0.59
ahan
-0.59
illary
-0.59
IPS
-0.59
culosis
-0.57
POSITIVE LOGITS
pant
0.69
ortium
0.64
hunted
0.64
æµ
0.62
isner
0.62
ashamed
0.60
ãģĹ
0.60
disgust
0.60
peror
0.60
sanct
0.60
Activations Density 0.192%