INDEX
Explanations
logical connectors such as conjunctions and disjunctions
New Auto-Interp
Negative Logits
ſelf
-0.86
]='\
-0.82
'},
-0.80
++
-0.78
/>";
-0.78
ſelves
-0.77
'],
-0.75
"])
-0.75
pleaſure
-0.73
`;
-0.73
POSITIVE LOGITS
I
1.01
you
0.88
there
0.79
we
0.78
they
0.76
stuff
0.75
everything
0.72
it
0.71
I
0.71
and
0.69
Activations Density 0.274%