INDEX
Explanations
text related to actions involving daring, risk-taking, and potentially controversial behavior
phrases associated with negative actions or comments
New Auto-Interp
Negative Logits
ndra
-0.79
ntil
-0.77
tions
-0.74
lich
-0.70
-+-+
-0.69
ambo
-0.68
etheless
-0.68
lished
-0.65
ategories
-0.63
vertisement
-0.63
POSITIVE LOGITS
aback
1.09
seriously
1.07
cues
1.03
liberties
1.02
stride
0.91
Seriously
0.90
hostage
0.90
cue
0.89
plunge
0.89
precautions
0.89
Activations Density 0.389%