INDEX
Explanations
instances of the pronoun "I" and related self-referential expressions
New Auto-Interp
Negative Logits
iae
-0.15
acon
-0.15
preview
-0.14
itom
-0.14
orough
-0.14
uide
-0.13
117
-0.13
커
-0.13
iaz
-0.13
opi
-0.13
POSITIVE LOGITS
suspect
0.33
sur
0.31
infer
0.30
suspects
0.28
assume
0.27
wonder
0.27
assumption
0.27
inference
0.26
assumes
0.26
ded
0.25
Activations Density 0.173%