INDEX
Explanations
references to sexual exploitation and trafficking
New Auto-Interp
Negative Logits
bjerg
-0.17
killer
-0.16
ettle
-0.16
noop
-0.14
nerg
-0.14
ULSE
-0.14
assassin
-0.14
setattr
-0.14
diet
-0.13
Leakage
-0.13
POSITIVE LOGITS
trafficking
0.41
Traff
0.39
traff
0.39
sex
0.35
Tra
0.31
human
0.31
-tra
0.29
traf
0.28
exploitation
0.28
forced
0.28
Activations Density 0.025%