INDEX
Explanations
negatively connotated terms or contexts
New Auto-Interp
Negative Logits
Nicarag
-0.74
Manson
-0.67
ciating
-0.67
Revis
-0.66
Flores
-0.63
sodium
-0.61
dispatch
-0.60
selage
-0.60
extrem
-0.60
Bosnia
-0.60
POSITIVE LOGITS
share
1.08
rate
1.05
out
1.01
through
0.99
along
0.99
away
0.99
cation
0.99
atten
0.98
outs
0.98
cycle
0.96
Activations Density 0.057%