INDEX
Explanations
references to causation and reasoning in statements
New Auto-Interp
Negative Logits
ird
-0.15
ourselves
-0.13
opl
-0.13
yourself
-0.13
Yourself
-0.13
/her
-0.13
ITAL
-0.12
yourselves
-0.12
igu
-0.12
372
-0.12
POSITIVE LOGITS
å®ĥ
0.48
its
0.44
it
0.44
å®ĥ们
0.42
оно
0.41
ï¼Įå®ĥ
0.40
they
0.34
Its
0.34
nó
0.33
Its
0.33
Activations Density 0.477%