INDEX
Explanations
important considerations and disclaimers
New Auto-Interp
Negative Logits
bragging
0.64
cunning
0.64
面白
0.62
hilarious
0.61
exaggerate
0.61
面白い
0.60
재미
0.59
quirks
0.59
glamorous
0.59
মজার
0.57
POSITIVE LOGITS
respectful
0.68
Sensitivity
0.64
Feminist
0.60
Sensitivity
0.59
sensitively
0.59
educators
0.59
feminist
0.58
respectfully
0.58
LGBTQ
0.58
ধর্ষণ
0.57
Activations Density 0.005%