INDEX
Explanations
words related to disapproval or criticism
words related to public relations or promotional content
New Auto-Interp
Negative Logits
Sacrament
-0.73
Winds
-0.69
BILITY
-0.67
ORED
-0.63
persistence
-0.63
Eagle
-0.63
Royale
-0.62
born
-0.62
Americas
-0.61
croft
-0.60
POSITIVE LOGITS
imate
1.12
uning
1.10
imes
1.08
arians
1.06
ices
0.97
atical
0.96
icy
0.94
ams
0.94
icks
0.92
asion
0.92
Activations Density 0.011%