INDEX
Explanations
insulting or critical phrases
words related to negative criticisms and societal issues
New Auto-Interp
Negative Logits
minster
-0.67
Newsletter
-0.66
Downs
-0.64
holm
-0.62
kaya
-0.61
Äĩ
-0.60
boro
-0.60
hover
-0.60
rav
-0.60
wake
-0.59
POSITIVE LOGITS
lished
0.70
ciating
0.63
lett
0.62
ļéĨĴ
0.61
Geek
0.57
metic
0.57
essors
0.57
UNC
0.53
hybrids
0.52
estyles
0.51
Activations Density 0.534%