INDEX
Explanations
negative or harmful words and phrases
negative descriptors related to harm or damage
New Auto-Interp
Negative Logits
Ħ¢
-0.80
£ı
-0.72
orthy
-0.69
ŃĶ
-0.67
uana
-0.66
ebus
-0.65
å§«
-0.63
rity
-0.63
ajo
-0.62
reinstated
-0.61
POSITIVE LOGITS
lessly
0.75
ishly
0.71
inhib
0.70
ingly
0.69
uminati
0.69
wasteful
0.67
umin
0.66
monkey
0.65
relegation
0.65
negativity
0.65
Activations Density 0.652%