INDEX
Explanations
themes related to negative experiences and emotions
New Auto-Interp
Negative Logits
sumpay
-0.55
bilingual
-0.53
complementary
-0.52
zest
-0.52
WireFormatLite
-0.52
complements
-0.51
قق
-0.51
accomplishments
-0.51
egli
-0.50
pioneers
-0.50
POSITIVE LOGITS
dangerous
0.68
😡
0.64
CWE
0.64
offending
0.63
attack
0.61
postsleuth
0.59
🤬
0.58
🤦
0.58
😨
0.57
malicious
0.57
Activations Density 1.669%