INDEX
Explanations
helpful AI assistant refusals
New Auto-Interp
Negative Logits
callSettings
0.42
MaterialApp
0.38
freshest
0.38
乘以
0.38
말미암아
0.38
的我
0.37
morally
0.36
births
0.36
antérieur
0.36
antigenic
0.36
POSITIVE LOGITS
chatbot
0.63
chatbot
0.57
assistant
0.52
AI
0.49
chatbots
0.49
guide
0.48
безопас
0.48
AI
0.46
helpful
0.46
innocuous
0.46
Activations Density 0.021%