INDEX
Explanations
negations or expressions of disagreement
New Auto-Interp
Negative Logits
not
-0.18
não
-0.16
never
-0.15
nicht
-0.14
не
-0.14
no
-0.14
niet
-0.13
ä¸įå¾Ĺ
-0.13
ummings
-0.13
uars
-0.13
POSITIVE LOGITS
ched
0.27
necessarily
0.26
ori
0.25
tingham
0.25
anymore
0.24
yet
0.23
ching
0.22
ches
0.22
epad
0.22
oriously
0.22
Activations Density 0.265%