INDEX
Explanations
African American Vernacular English
New Auto-Interp
Negative Logits
(
0.61
ում
0.57
Об
0.55
Су
0.55
João
0.55
Australia
0.53
Кар
0.52
doesn
0.52
Kết
0.52
Дру
0.52
POSITIVE LOGITS
in
0.65
z
0.61
homem
0.52
ad
0.52
em
0.52
entrepreneur
0.50
b
0.49
c
0.49
f
0.49
mark
0.48
Activations Density 0.002%