INDEX
Explanations
mentions of "dirty" or morally questionable actions/concepts
references to "dirty" or unethical practices and behaviors
New Auto-Interp
Negative Logits
*/(
-1.20
istically
-0.82
itech
-0.80
XT
-0.78
aic
-0.77
isol
-0.76
HCR
-0.75
ãĥĦ
-0.74
izations
-0.73
uther
-0.73
POSITIVE LOGITS
laundry
1.23
tricks
1.08
linen
1.06
diapers
0.97
rotten
0.93
dirty
0.92
diaper
0.88
luc
0.86
trick
0.85
dishes
0.83
Activations Density 0.075%