INDEX
Explanations
references to specific individuals or groups in leadership or authoritative positions
New Auto-Interp
Negative Logits
ingleton
-0.17
vail
-0.15
buch
-0.15
importe
-0.15
frei
-0.14
_patch
-0.14
ester
-0.14
apiro
-0.14
hari
-0.14
LÃłm
-0.14
POSITIVE LOGITS
yles
0.19
Rob
0.15
232
0.14
anch
0.14
Cub
0.14
ibi
0.14
Bru
0.14
rium
0.14
on
0.14
boro
0.14
Activations Density 0.032%