INDEX
Explanations
phrases related to explanation or reasoning
phrases that indicate attribution or parts of a whole
New Auto-Interp
Negative Logits
soever
-0.85
rams
-0.79
ancies
-0.75
ittal
-0.75
teasp
-0.74
ãģ®éŃĶ
-0.72
estyles
-0.70
oons
-0.69
ãĤ¼ãĤ¦ãĤ¹
-0.68
erion
-0.66
POSITIVE LOGITS
why
1.22
what
0.92
reason
0.89
explaining
0.83
why
0.79
being
0.79
WHY
0.78
me
0.77
understanding
0.76
overcoming
0.75
Activations Density 0.076%