INDEX
Explanations
phrases expressing strong negative emotions or criticisms towards others
expressions of regret or disgrace towards individuals or groups
New Auto-Interp
Negative Logits
livest
-0.79
llan
-0.75
stabilization
-0.75
Downloadha
-0.74
nels
-0.73
minster
-0.72
perature
-0.71
atonin
-0.70
combe
-0.69
ernand
-0.69
POSITIVE LOGITS
taxpayers
0.84
ãĤ®
0.82
ij士
0.81
cheated
0.80
victims
0.80
inflicted
0.79
da
0.77
humanity
0.73
anyone
0.73
shareholders
0.72
Activations Density 0.298%