INDEX
Explanations
words related to negative reputation or bad behavior
terms related to reputation and pollution
New Auto-Interp
Negative Logits
kell
-0.77
ockets
-0.68
joint
-0.66
stay
-0.63
bees
-0.63
heels
-0.63
ork
-0.62
immer
-0.62
appers
-0.62
chrom
-0.59
POSITIVE LOGITS
uted
1.38
uting
1.19
anamo
1.15
uality
0.94
utes
0.91
utations
0.83
ously
0.79
agate
0.75
iating
0.75
iously
0.71
Activations Density 0.009%