INDEX
Explanations
secretly wronged or useless
New Auto-Interp
Negative Logits
鲔
0.53
髖
0.49
鈮
0.48
煐
0.47
фараз
0.46
esimerk
0.46
structuring
0.45
莴
0.45
鳟
0.45
矵
0.44
POSITIVE LOGITS
blackmail
0.49
secretly
0.48
ji
0.48
wronged
0.46
useless
0.44
stupid
0.44
hehe
0.43
hehe
0.43
pretended
0.43
traitor
0.43
Activations Density 0.005%