INDEX
Explanations
explanatory statements
phrases related to explaining concepts or phenomena
New Auto-Interp
Negative Logits
ngth
-0.82
ille
-0.75
emies
-0.75
kus
-0.72
sembly
-0.70
jab
-0.67
ontent
-0.66
Instruments
-0.66
ctors
-0.65
opers
-0.63
POSITIVE LOGITS
why
1.65
why
1.33
WHY
1.31
how
0.96
discrepancies
0.92
Why
0.88
inconsistencies
0.87
Why
0.84
explanations
0.81
disapp
0.80
Activations Density 0.052%