INDEX
Explanations
phrases related to providing explanations or reasons
phrases that indicate explanations or justifications
New Auto-Interp
Negative Logits
ography
-0.77
emies
-0.76
ctors
-0.74
nets
-0.74
heit
-0.71
Dialogue
-0.71
jab
-0.69
dayName
-0.69
nown
-0.68
ograp
-0.67
POSITIVE LOGITS
why
1.53
why
1.13
WHY
0.99
discrepancies
0.97
variance
0.91
reluctance
0.89
inconsistencies
0.84
discrep
0.82
how
0.80
disparities
0.79
Activations Density 0.126%