INDEX
Explanations
phrases indicating negation or denial of responsibility
New Auto-Interp
Negative Logits
ruba
-0.17
جر
-0.17
oklyn
-0.16
uzey
-0.15
íĥģ
-0.14
ë°į
-0.14
.Apis
-0.14
reopen
-0.14
utral
-0.14
arel
-0.14
POSITIVE LOGITS
still
0.18
obs
0.17
mon
0.16
acho
0.14
fault
0.14
ora
0.14
era
0.14
alt
0.14
amine
0.14
ainda
0.14
Activations Density 0.115%