INDEX
Explanations
actions involving explanations or reasons
words associated with explaining or giving reasons
New Auto-Interp
Negative Logits
luster
-0.81
assic
-0.69
dar
-0.67
rica
-0.67
ille
-0.66
sembly
-0.65
inates
-0.64
Pont
-0.63
itton
-0.63
oreal
-0.63
POSITIVE LOGITS
why
1.75
WHY
1.47
how
1.36
why
1.34
how
1.04
HOW
0.97
Why
0.97
Why
0.91
what
0.90
away
0.83
Activations Density 0.069%