INDEX
Explanations
phrases related to cause and effect or explanation
instances of the word "this."
New Auto-Interp
Negative Logits
Pets
-0.70
Drops
-0.67
agi
-0.65
Papers
-0.64
Daniels
-0.64
Personal
-0.63
oots
-0.62
mates
-0.62
aws
-0.61
Fit
-0.61
POSITIVE LOGITS
particular
0.96
latter
0.94
trope
0.88
phenomenon
0.82
article
0.80
newfound
0.79
arrangement
0.78
subset
0.78
invention
0.75
behaviour
0.74
Activations Density 0.270%