INDEX
Explanations
preventing harm and exploitation
New Auto-Interp
Negative Logits
nifty
1.02
tasty
0.97
funky
0.92
wacky
0.88
quirky
0.88
=!
0.87
handy
0.86
weird
0.85
annoying
0.83
giz
0.83
POSITIVE LOGITS
trauma
1.30
tragically
1.25
retra
1.18
heartbreaking
1.09
compassion
1.06
traumas
1.05
Trauma
1.05
compassionate
1.04
harm
1.03
sadly
1.03
Activations Density 1.814%