INDEX
Explanations
expressions of personal preferences or dislikes
expressions of dislike or negative sentiment
New Auto-Interp
Negative Logits
yrinth
-0.83
ensional
-0.83
achine
-0.81
aunder
-0.81
rontal
-0.80
monary
-0.79
minster
-0.78
alde
-0.77
igmatic
-0.77
estones
-0.76
POSITIVE LOGITS
anymore
1.06
anybody
0.90
anyone
0.82
anything
0.81
bullies
0.79
undue
0.77
any
0.76
nor
0.76
surprises
0.74
cens
0.72
Activations Density 0.084%