INDEX
Explanations
repeated phrases indicating inclusivity or universality
New Auto-Interp
Negative Logits
eworthy
-0.17
ulate
-0.17
midi
-0.17
side
-0.15
785
-0.15
ãi
-0.15
ioned
-0.14
atcher
-0.14
nel
-0.14
elines
-0.14
POSITIVE LOGITS
ones
0.21
THING
0.19
/all
0.18
hone
0.18
thin
0.17
though
0.17
where
0.16
ong
0.16
theless
0.16
-other
0.16
Activations Density 0.085%