INDEX
Explanations
negative words and sentiments related to emotions or conflicts
negative connotations or expressions of disgust
New Auto-Interp
Negative Logits
trained
-0.70
individually
-0.69
imately
-0.68
terday
-0.67
silenced
-0.66
deceived
-0.66
theless
-0.65
probable
-0.65
curiously
-0.64
misled
-0.64
POSITIVE LOGITS
ocations
1.02
aution
1.01
tones
0.97
eness
0.90
ptions
0.89
isons
0.89
usions
0.88
isms
0.87
otypes
0.87
ifts
0.87
Activations Density 0.282%