INDEX
Explanations
phrases related to explanations or reasoning
phrases that clarify reasons or justifications
New Auto-Interp
Negative Logits
emies
-0.82
ille
-0.78
heit
-0.73
ontent
-0.72
jab
-0.72
nets
-0.72
ctors
-0.69
ngth
-0.68
ionics
-0.68
estial
-0.67
POSITIVE LOGITS
why
1.70
why
1.31
WHY
1.21
discrepancies
0.98
how
0.96
Why
0.91
inconsistencies
0.89
Why
0.85
reluctance
0.84
variance
0.82
Activations Density 0.114%