INDEX
Explanations
medical or other professional advice
Sentences or phrases where the model refuses harmful requests and provides safety guidance, resource links, and crisis/support information.
New Auto-Interp
Negative Logits
இருந்தாலும்
0.45
Trends
0.42
অ্যান্ড্র
0.41
Experience
0.41
Library
0.40
अक्सर
0.40
varietà
0.40
போலவே
0.39
便利
0.39
Crypt
0.38
POSITIVE LOGITS
urgently
0.53
murderous
0.53
manifestly
0.50
Neces
0.50
IMMEDI
0.49
endangering
0.48
perpetrators
0.48
absolutamente
0.48
dringend
0.48
urgente
0.47
Activations Density 0.388%