INDEX
Explanations
terms signaling the need for explanation or justification
phrases related to explaining concepts or phenomena
New Auto-Interp
Negative Logits
sembly
-0.79
ngth
-0.72
Ranked
-0.70
opers
-0.68
ches
-0.66
ille
-0.65
net
-0.63
inion
-0.63
illet
-0.62
estial
-0.61
POSITIVE LOGITS
why
1.35
WHY
1.20
why
1.12
explanations
0.87
ĸļ
0.81
Origin
0.79
how
0.78
disapp
0.78
abl
0.73
away
0.72
Activations Density 0.058%