INDEX
Explanations
explanations or reasoning in a text
New Auto-Interp
Negative Logits
icipated
-0.66
apers
-0.66
rift
-0.64
iership
-0.63
RAW
-0.63
actionDate
-0.62
shaw
-0.61
vez
-0.60
display
-0.60
ourse
-0.58
POSITIVE LOGITS
yeah
1.04
yeah
1.04
hhh
0.99
hhhh
0.95
prest
0.94
Yeah
0.93
kidding
0.92
pardon
0.89
mmm
0.88
yea
0.87
Activations Density 0.648%