INDEX
Explanations
words relating to health risks and potential harm
terms related to the negative effects of substances or actions
New Auto-Interp
Negative Logits
rollers
-0.69
ciples
-0.66
ahs
-0.65
zar
-0.63
doms
-0.62
ilde
-0.62
gres
-0.62
IDA
-0.61
quart
-0.60
elle
-0.59
POSITIVE LOGITS
insofar
0.95
enough
0.93
owing
0.87
unless
0.86
compared
0.85
towards
0.80
because
0.79
deterrent
0.76
toward
0.75
against
0.74
Activations Density 0.245%