INDEX
Explanations
terms related to safety and security
terms related to safety
New Auto-Interp
Negative Logits
fred
-0.82
frey
-0.81
essee
-0.78
eric
-0.75
attr
-0.75
hun
-0.72
pel
-0.70
ette
-0.68
eds
-0.68
ional
-0.65
POSITIVE LOGITS
safer
1.03
safest
1.02
safe
0.92
saf
0.86
redes
0.79
endanger
0.78
havens
0.78
alternatives
0.77
ashtra
0.76
safety
0.74
Activations Density 0.006%