INDEX
Explanations
names, particularly those associated with crime or scandals
New Auto-Interp
Negative Logits
oog
-0.20
ovel
-0.19
ed
-0.18
iag
-0.18
iens
-0.16
ovsky
-0.16
ogie
-0.16
eses
-0.16
oton
-0.16
ogs
-0.15
POSITIVE LOGITS
rr
0.26
ington
0.22
inger
0.22
r
0.21
era
0.20
amient
0.20
icks
0.20
ick
0.19
ort
0.19
itt
0.19
Activations Density 0.022%