INDEX
Explanations
phrases prompting the reader to pay attention or consider something
New Auto-Interp
Negative Logits
lees
-0.87
oing
-0.80
sbm
-0.79
soever
-0.74
iar
-0.74
=~=~
-0.74
oided
-0.70
iere
-0.68
raged
-0.68
ittle
-0.66
POSITIVE LOGITS
WHY
1.11
something
1.08
what
1.05
why
1.04
how
0.99
causation
0.97
ourselves
0.96
basics
0.94
some
0.91
facts
0.90
Activations Density 0.218%