INDEX
Explanations
references to race-related violence and exploitation
New Auto-Interp
Negative Logits
373
-0.14
irts
-0.13
crets
-0.13
nst
-0.13
ennon
-0.12
uilder
-0.12
ublished
-0.12
å°ij女
-0.12
rains
-0.12
raj
-0.12
POSITIVE LOGITS
expend
0.35
disposable
0.32
fodder
0.30
chatt
0.30
prey
0.28
pawn
0.27
cannon
0.27
targets
0.26
collateral
0.26
Disposable
0.25
Activations Density 0.275%