INDEX
Explanations
example explanations and code
New Auto-Interp
Negative Logits
crappy
2.17
shitty
2.08
bullshit
2.06
kinda
2.02
messed
1.93
mensen
1.92
boobs
1.90
haha
1.82
kids
1.80
yeah
1.79
POSITIVE LOGITS
strikingly
1.77
此外
1.56
remarkably
1.55
markedly
1.54
Invoke
1.51
crucially
1.51
renowned
1.50
unequivocal
1.49
unquestionably
1.48
regarded
1.48
Activations Density 0.422%