INDEX
Explanations
phrases indicating causality or logical reasoning
instances of the word "makes."
New Auto-Interp
Negative Logits
rist
-0.56
uish
-0.52
iped
-0.51
phy
-0.50
Agency
-0.50
Thirty
-0.50
orne
-0.49
imen
-0.49
vez
-0.49
wana
-0.49
POSITIVE LOGITS
makes
2.83
makes
2.32
Makes
2.16
gives
1.93
creates
1.90
helps
1.82
distinguishes
1.81
lends
1.76
brings
1.75
proves
1.73
Activations Density 0.026%