INDEX
Explanations
genuine or true + positive concept
New Auto-Interp
Negative Logits
𝙥
1.36
𝙤
1.36
twos
1.33
lesion
1.30
𝒐
1.30
িণ
1.29
refrain
1.27
𝑛
1.26
vien
1.26
нци
1.25
POSITIVE LOGITS
ところ
1.65
estate
1.61
paar
1.58
politik
1.58
অর্
1.53
পক্ষ
1.52
ignment
1.49
ligen
1.43
鍮
1.41
यल
1.41
Activations Density 0.165%