INDEX
Explanations
instances of mixed emotional responses or contradictions in context
New Auto-Interp
Negative Logits
hari
-0.18
incident
-0.17
incident
-0.15
echa
-0.15
Nash
-0.15
HECK
-0.15
Incident
-0.14
UGHT
-0.14
imetype
-0.14
Rag
-0.14
POSITIVE LOGITS
regret
0.26
sorry
0.26
sorry
0.25
unfortunately
0.25
disappointment
0.24
sad
0.24
unfortunate
0.24
Sorry
0.23
disappointed
0.23
disappointing
0.23
Activations Density 0.186%