INDEX
Explanations
phrases indicating causality or conditionality
New Auto-Interp
Negative Logits
yna
-0.69
elve
-0.66
CVE
-0.65
pione
-0.63
Virgin
-0.63
inois
-0.63
uty
-0.63
isively
-0.62
atl
-0.61
atri
-0.61
POSITIVE LOGITS
they
1.33
THEY
1.10
something
1.08
someone
1.06
there
1.04
somebody
1.02
you
1.01
everything
0.98
it
0.94
theirs
0.91
Activations Density 0.235%