INDEX
Explanations
keywords associated with providing explanations or justifications
phrases explaining motivations or justifications
New Auto-Interp
Negative Logits
stocking
-0.70
annis
-0.70
puck
-0.68
KY
-0.67
ymph
-0.67
aeper
-0.66
helicop
-0.65
Roller
-0.65
avorite
-0.65
thus
-0.65
POSITIVE LOGITS
why
1.10
WHY
0.98
why
0.88
abl
0.83
reason
0.80
rationale
0.77
usercontent
0.76
justifying
0.75
orial
0.75
="#
0.74
Activations Density 0.025%