INDEX
Explanations
prohibiting harmful responses
New Auto-Interp
Negative Logits
挀
0.42
Ears
0.40
Composer
0.40
顐
0.40
滖
0.39
ostino
0.39
composer
0.37
حصول
0.37
使用了
0.37
🍚
0.36
POSITIVE LOGITS
dangerous
0.69
hate
0.65
Dangerous
0.61
hazardous
0.59
опас
0.57
dangereux
0.57
hates
0.55
dangere
0.54
gefähr
0.54
Hazardous
0.54
Activations Density 0.138%