INDEX
Explanations
the cause or motive behind events or actions
language related to motives and causes
New Auto-Interp
Negative Logits
abo
-0.78
owship
-0.73
gard
-0.71
buff
-0.70
udo
-0.67
NRS
-0.65
mun
-0.65
OWS
-0.64
paio
-0.64
byter
-0.63
POSITIVE LOGITS
behind
1.18
why
1.03
motivating
0.99
underlying
0.97
motives
0.95
culprit
0.95
why
0.92
responsible
0.92
motive
0.92
motivations
0.90
Activations Density 0.273%