INDEX
Explanations
reasons or explanations in a text
New Auto-Interp
Negative Logits
semble
-0.74
ibaba
-0.67
ymph
-0.66
Roller
-0.64
ault
-0.63
chron
-0.63
Carbuncle
-0.61
rop
-0.61
izen
-0.60
transm
-0.60
POSITIVE LOGITS
why
1.37
why
1.12
WHY
1.09
abl
1.02
Why
0.84
Why
0.82
justifying
0.80
Origin
0.77
cele
0.76
rationale
0.73
Activations Density 1.551%