INDEX
Explanations
phrases related to excuses or justifications
New Auto-Interp
Negative Logits
semble
-0.92
ropolitan
-0.76
erial
-0.75
opy
-0.75
ymph
-0.73
marks
-0.72
ropolis
-0.72
opers
-0.70
efully
-0.70
mark
-0.70
POSITIVE LOGITS
excuse
1.08
justifying
1.04
WHY
1.00
excuses
0.96
explanations
0.96
explanation
0.94
rationale
0.93
why
0.91
explaining
0.89
justification
0.88
Activations Density 0.129%