INDEX
Explanations
rejection and repelling advances
New Auto-Interp
Negative Logits
ัก
0.60
রা
0.57
ोत्
0.55
ू
0.53
बलेट
0.52
袢
0.52
hound
0.52
아니
0.51
बनवा
0.51
明治
0.49
POSITIVE LOGITS
Rejected
1.04
rejected
0.96
rejection
0.86
rejected
0.86
Rejected
0.82
reject
0.79
reject
0.79
rejects
0.78
Reject
0.78
rejet
0.72
Activations Density 0.027%