INDEX
Explanations
conjunctions and phrases indicating connections between concepts
New Auto-Interp
Negative Logits
himself
-0.91
herself
-0.88
herself
-0.80
Himself
-0.80
himself
-0.79
him
-0.73
themselves
-0.73
lui
-0.72
them
-0.71
and
-0.71
POSITIVE LOGITS
there
1.52
it
1.49
although
1.25
they
1.15
the
1.14
its
1.12
while
1.10
this
1.05
when
0.95
these
0.92
Activations Density 0.683%