INDEX
Explanations
Chinese health queries and JSON keywords
sentences where the assistant asserts it's a safe/helpful AI and refuses or explains why it cannot comply (safety/ refusal boilerplate).
New Auto-Interp
Negative Logits
pumpkin
0.54
PACKAGE
0.50
ERSHIP
0.50
ាត់
0.50
ే
0.48
crispy
0.48
avoidable
0.48
ach
0.46
inars
0.46
اك
0.46
POSITIVE LOGITS
z
0.57
ک
0.57
gpt
0.56
Convers
0.55
ஜ
0.54
María
0.52
O
0.52
Robot
0.52
mig
0.52
openai
0.52
Activations Density 2.839%