INDEX
Explanations
references to various national or ethnic identities
New Auto-Interp
Negative Logits
lessly
-0.19
PÅĻÃŃ
-0.17
lying
-0.15
´s
-0.15
ymax
-0.15
ptron
-0.15
evi
-0.15
panies
-0.15
ful
-0.14
itself
-0.14
POSITIVE LOGITS
who
0.31
who
0.26
'
0.26
whom
0.22
’
0.21
cape
0.19
-Americans
0.18
-American
0.18
que
0.18
ided
0.18
Activations Density 0.088%