INDEX
Explanations
phrases indicating causation or reasoning, particularly with the word "because."
New Auto-Interp
Negative Logits
lez
-0.72
wn
-0.71
robe
-0.69
agin
-0.69
ax
-0.69
mint
-0.66
lem
-0.65
nin
-0.64
Gas
-0.64
yan
-0.63
POSITIVE LOGITS
they
1.04
nobody
0.90
there
0.89
it
0.84
unlike
0.82
otherwise
0.81
we
0.79
THEY
0.79
*/(
0.78
he
0.75
Activations Density 0.561%