INDEX
Explanations
negations and phrases indicating exceptions or contrasts
New Auto-Interp
Negative Logits
486
-0.18
ounder
-0.16
innacle
-0.16
meanwhile
-0.15
ale
-0.15
опаÑģ
-0.15
IMA
-0.14
addir
-0.14
imar
-0.14
esco
-0.14
POSITIVE LOGITS
necessarily
0.17
çĹ
0.16
withstanding
0.16
chers
0.16
ANJI
0.15
ting
0.15
adena
0.14
rons
0.14
дÑĢев
0.14
tingham
0.14
Activations Density 0.047%