INDEX
Explanations
words related to negativity, evil, and criminal activities
terms related to negative or malicious attributes and actions
New Auto-Interp
Negative Logits
ainers
-0.87
gain
-0.75
ersion
-0.70
Surv
-0.70
onga
-0.69
hene
-0.68
grain
-0.68
Recomm
-0.67
ña
-0.66
aple
-0.66
POSITIVE LOGITS
plotting
1.00
mastermind
0.95
deed
0.94
nefarious
0.92
mischief
0.92
deeds
0.92
intent
0.86
undermin
0.85
sche
0.84
havoc
0.84
Activations Density 0.054%