INDEX
Negative Logits
ethics
0.47
Ethics
0.46
morals
0.45
ethics
0.43
semi
0.41
ethical
0.41
Ethics
0.40
ethically
0.40
morality
0.40
simplified
0.40
POSITIVE LOGITS
safe
0.94
安全
0.86
सुरक्षित
0.86
safe
0.82
безопас
0.80
Safe
0.79
Safe
0.78
안전
0.77
bezpie
0.75
安全
0.72
Activations Density 0.060%