INDEX
Explanations
phrases indicating reasoning or causation
the word "because" in various contexts
New Auto-Interp
Negative Logits
wn
-0.79
shaw
-0.76
ardon
-0.74
agin
-0.73
ns
-0.72
yan
-0.68
ery
-0.67
jet
-0.67
yr
-0.66
mint
-0.66
POSITIVE LOGITS
*/(
0.78
they
0.73
anecd
0.64
proxies
0.64
ecause
0.63
there
0.63
nobody
0.62
frankly
0.61
we
0.60
mathematic
0.60
Activations Density 0.068%