INDEX
Explanations
words or phrases related to reasons or causes
instances of the word "because" indicating causal relationships
New Auto-Interp
Negative Logits
mint
-0.74
wn
-0.71
yan
-0.71
Gas
-0.70
alin
-0.70
agin
-0.69
lem
-0.69
ries
-0.67
ymph
-0.66
âĤ¬
-0.65
POSITIVE LOGITS
they
0.86
*/(
0.79
ecause
0.69
otherwise
0.67
someone
0.67
THEY
0.66
akening
0.65
ority
0.65
mistakenly
0.64
there
0.64
Activations Density 0.080%