INDEX
Explanations
reasons or explanations
expressions that indicate reasoning or justification
New Auto-Interp
Negative Logits
ty
-0.70
rop
-0.62
ãĥ¼ãĤ¯
-0.62
zman
-0.62
bow
-0.60
ãĤ¹
-0.60
transm
-0.59
aith
-0.59
exchanged
-0.58
prom
-0.58
POSITIVE LOGITS
Canaver
0.89
why
0.87
soever
0.85
forth
0.79
ratulations
0.69
forward
0.68
Emblem
0.65
terday
0.63
afa
0.62
we
0.61
Activations Density 0.041%