INDEX
Explanations
terms related to safety and security
New Auto-Interp
Negative Logits
TestBed
-0.51
ContentAsync
-0.48
kautta
-0.46
ANDUM
-0.45
volna
-0.45
mío
-0.44
szól
-0.41
imeni
-0.40
ckså
-0.40
gärna
-0.40
POSITIVE LOGITS
Safety
0.77
Safety
0.73
safety
0.73
SAFETY
0.70
SAFETY
0.70
safety
0.68
SAFE
0.65
安全
0.60
Safe
0.59
safe
0.59
Activations Density 0.018%