INDEX
Explanations
phrases related to reasoning or causality
New Auto-Interp
Negative Logits
wn
-0.74
lez
-0.74
ax
-0.70
robe
-0.70
agin
-0.69
mint
-0.67
nin
-0.66
Gas
-0.64
hal
-0.64
age
-0.64
POSITIVE LOGITS
they
1.09
there
0.91
nobody
0.90
it
0.87
THEY
0.85
otherwise
0.83
we
0.81
unlike
0.80
*/(
0.78
he
0.75
Activations Density 1.159%