INDEX
Explanations
explaining problematic jokes
New Auto-Interp
Negative Logits
AMENTE
0.74
urie
0.73
az
0.72
ana
0.70
uel
0.68
tokens
0.68
Token
0.68
ann
0.67
ă
0.66
utral
0.66
POSITIVE LOGITS
cadmium
0.88
walkway
0.85
bulky
0.85
protective
0.84
mountainous
0.82
prestigious
0.82
bustling
0.82
infused
0.81
prospective
0.79
hilltop
0.79
Activations Density 0.003%