INDEX
Explanations
phrases related to providing explanations or justifications
multiple instances of the word "reasons" indicating various justifications or causes
New Auto-Interp
Negative Logits
yss
-0.68
Winged
-0.68
puck
-0.67
franc
-0.66
Pixie
-0.66
needle
-0.64
ream
-0.63
Mous
-0.63
enged
-0.62
esc
-0.62
POSITIVE LOGITS
why
0.91
reasons
0.89
cale
0.84
WHY
0.84
pointers
0.83
arguments
0.80
æĦ
0.78
justifying
0.78
why
0.77
explanations
0.75
Activations Density 0.020%