INDEX
Explanations
text related to criticizing or mocking others
New Auto-Interp
Negative Logits
Warehouse
-0.72
rebuilt
-0.71
Located
-0.67
romeda
-0.64
pioneering
-0.64
chnology
-0.61
bitious
-0.61
phalt
-0.61
ufact
-0.60
erenn
-0.60
POSITIVE LOGITS
sarcastic
1.02
slurs
1.00
ridicule
0.98
misinterpret
0.96
misunderstand
0.95
insults
0.92
insulting
0.92
replies
0.91
jokes
0.90
condesc
0.89
Activations Density 0.862%