INDEX
Explanations
discussion of policy and value
New Auto-Interp
Negative Logits
ك
0.50
uerpo
0.47
گ
0.45
دي
0.44
الب
0.44
ب
0.43
pate
0.43
الك
0.43
([-
0.43
الس
0.43
POSITIVE LOGITS
अधिकांश
0.44
🟦
0.41
이제
0.41
🛸
0.40
<unused2150>
0.40
🩶
0.40
극장
0.39
🤎
0.39
💟
0.39
🫶
0.39
Activations Density 0.000%