INDEX
Explanations
mentions of drug-related terms
terms and phrases related to injury or self-harm
New Auto-Interp
Negative Logits
BOOK
-0.89
UAL
-0.85
manship
-0.76
\\\\\\\\
-0.76
ãĤµãĥ¼ãĥĨãĤ£ãĥ¯ãĥ³
-0.73
soDeliveryDate
-0.73
hare
-0.72
DragonMagazine
-0.69
ãģ®éŃĶ
-0.68
AU
-0.68
POSITIVE LOGITS
inite
1.31
luence
1.22
rastructure
1.17
ractions
1.15
inity
1.12
licted
1.11
acet
1.08
amous
1.08
erno
1.07
raction
1.05
Activations Density 0.008%