INDEX
Explanations
positive moral attributes or qualities such as nobility and goodness
New Auto-Interp
Negative Logits
aq
-0.75
esville
-0.74
ingo
-0.72
oan
-0.70
ing
-0.69
Controlled
-0.68
olina
-0.67
aby
-0.67
olver
-0.65
yss
-0.65
POSITIVE LOGITS
deeds
1.12
intentions
0.97
minded
0.93
laureate
0.93
indignation
0.89
gentlemen
0.88
virtues
0.87
minded
0.87
gentleman
0.87
pursuits
0.85
Activations Density 0.196%