INDEX
Explanations
adjectives expressing lack of harm or danger
terms related to the concepts of harmlessness and benignity
New Auto-Interp
Negative Logits
KER
-0.70
GPU
-0.69
lining
-0.66
funding
-0.64
Cla
-0.64
Pain
-0.62
pain
-0.62
ingo
-0.61
lin
-0.61
Connell
-0.60
POSITIVE LOGITS
harmless
1.04
innocuous
0.92
benign
0.85
»Ĵ
0.81
minded
0.81
alty
0.78
bystand
0.77
ality
0.76
mate
0.71
ishable
0.69
Activations Density 0.016%