INDEX
Explanations
references to drugging or related terms
references to drugging
New Auto-Interp
Negative Logits
ray
-0.79
coloured
-0.76
road
-0.69
Pastebin
-0.68
rir
-0.68
ution
-0.68
Sabha
-0.68
rique
-0.67
aceutical
-0.66
pite
-0.65
POSITIVE LOGITS
dru
1.14
squat
0.70
scaling
0.70
advertising
0.68
dup
0.68
raping
0.63
doub
0.63
submar
0.62
elig
0.62
dumping
0.61
Activations Density 0.001%