INDEX
Negative Logits
Safety
0.50
safety
0.46
安全
0.46
parfois
0.46
ethics
0.46
Boundary
0.46
ethics
0.45
verantwort
0.44
सुरक्षित
0.44
please
0.43
POSITIVE LOGITS
Explicit
0.55
Detailed
0.48
goes
0.48
explicitly
0.47
contradicts
0.47
explicit
0.47
Goes
0.46
Platforms
0.46
LI
0.45
Directly
0.45
Activations Density 0.008%