INDEX
Explanations
words related to negative feedback or disapproval
terms related to critique or disapproval
New Auto-Interp
Negative Logits
cise
-0.70
frey
-0.66
tre
-0.64
eret
-0.63
ovember
-0.60
pared
-0.60
stocking
-0.60
tein
-0.60
coat
-0.59
ipeg
-0.59
POSITIVE LOGITS
criticism
0.95
代
0.93
critic
0.91
critics
0.86
criticisms
0.86
leveled
0.83
arial
0.82
critiques
0.80
naires
0.77
é¾įå¥ij士
0.76
Activations Density 0.019%