INDEX
Explanations
mentions of motives or reasons behind actions
references to motives behind actions or events
New Auto-Interp
Negative Logits
semble
-0.94
opy
-0.84
thumbnails
-0.84
alus
-0.81
ropolis
-0.78
ummer
-0.77
ogun
-0.72
hap
-0.72
redd
-0.71
mark
-0.70
POSITIVE LOGITS
motives
1.15
motive
1.10
justifying
1.06
motivations
1.03
behind
1.00
rationale
0.99
motivation
0.94
why
0.87
justify
0.87
reasoning
0.82
Activations Density 0.070%