INDEX
Explanations
simple to understand and implement
New Auto-Interp
Negative Logits
tedes
0.45
librarian
0.43
venue
0.42
barbers
0.41
scholar
0.41
recruit
0.41
chickpeas
0.40
khe
0.40
slay
0.40
stim
0.40
POSITIVE LOGITS
свое
0.54
ewater
0.42
своем
0.42
Norweg
0.42
avanje
0.41
ний
0.41
받
0.41
오
0.40
elu
0.40
своей
0.39
Activations Density 0.004%