INDEX
Negative Logits
Contribution
0.47
UpSync
0.42
푎
0.42
Anybody
0.41
ᶤ
0.41
contribution
0.41
ੋਰ
0.40
нк
0.40
صفحات
0.40
Anybody
0.40
POSITIVE LOGITS
OpenAI
0.74
回答
0.72
GPT
0.65
openai
0.61
답변
0.61
openai
0.60
応答
0.60
response
0.59
response
0.57
refusing
0.57
Activations Density 0.050%