INDEX
Explanations
Southern Poverty Law Center
New Auto-Interp
Negative Logits
to
1.49
of
1.38
ش
1.28
(
1.20
ü
1.19
is
1.17
to
1.16
전
1.16
has
1.15
리
1.14
POSITIVE LOGITS
’
0.98
:”
0.96
:
0.87
شرطونه
0.78
дора
0.78
ς
0.77
])));
0.76
تين
0.76
MiddleLine
0.75
вою
0.75
Activations Density 0.001%