INDEX
Explanations
negations or expressions of disagreement
New Auto-Interp
Negative Logits
nor
-0.17
辺
-0.15
nor
-0.14
nackte
-0.14
346
-0.14
itti
-0.14
inee
-0.14
nack
-0.13
(utf
-0.13
atern
-0.13
POSITIVE LOGITS
sure
0.30
sure
0.24
Sure
0.23
Sure
0.21
gonna
0.20
exactly
0.19
gon
0.17
icias
0.17
tingham
0.16
SURE
0.16
Activations Density 0.052%