INDEX
Explanations
mentions of safety-related issues and concerns in various contexts
New Auto-Interp
Negative Logits
Butcher
-0.61
Wand
-0.58
nude
-0.57
Rand
-0.57
cameo
-0.56
oret
-0.56
Naked
-0.56
shepherd
-0.55
Pound
-0.55
soundtrack
-0.55
POSITIVE LOGITS
levels
0.85
barriers
0.77
itism
0.76
morale
0.75
flows
0.75
pathways
0.73
ahime
0.73
expectations
0.72
nationwide
0.71
among
0.71
Activations Density 0.177%