INDEX
Explanations
phrases related to problems, danger, harm, and injuries
terms related to safety concerns and injuries
New Auto-Interp
Negative Logits
ku
-0.88
kt
-0.76
liam
-0.71
kus
-0.70
ãĤ©
-0.70
kh
-0.67
utm
-0.66
bal
-0.65
last
-0.65
Loft
-0.64
POSITIVE LOGITS
nor
1.01
whatsoever
0.95
anymore
0.89
erno
0.71
anywhere
0.63
ocard
0.61
detectable
0.60
temptation
0.60
Spray
0.60
slightest
0.58
Activations Density 0.870%