INDEX
Explanations
phrases expressing reasoning or cause and effect
phrases that introduce reasoning or justification
New Auto-Interp
Negative Logits
ty
-0.65
MM
-0.61
kil
-0.60
exchanged
-0.60
mage
-0.60
BM
-0.60
MM
-0.59
woodland
-0.59
ute
-0.59
room
-0.59
POSITIVE LOGITS
why
0.94
soever
0.88
forward
0.80
forth
0.76
why
0.71
Canaver
0.70
WHY
0.68
ioned
0.67
¿½
0.65
HAEL
0.64
Activations Density 0.026%