INDEX
Explanations
words related to reasons or intentions behind actions
terms related to motives or reasons behind actions
New Auto-Interp
Negative Logits
thumbnails
-0.92
semble
-0.84
ropolis
-0.80
ummer
-0.73
mark
-0.72
hap
-0.70
enegger
-0.70
opy
-0.69
owship
-0.69
kson
-0.68
POSITIVE LOGITS
motives
1.07
justifying
1.02
motive
0.99
behind
0.96
motivations
0.95
rationale
0.91
why
0.84
motivation
0.82
WHY
0.78
reasoning
0.77
Activations Density 0.040%