INDEX
Explanations
prioritize safety, society, efficiency
New Auto-Interp
Negative Logits
closer
0.38
andelion
0.37
nul
0.37
rings
0.37
Bast
0.37
through
0.37
embracing
0.37
new
0.35
Reception
0.35
Reception
0.35
POSITIVE LOGITS
sacrificed
1.05
sacrifice
1.01
sacrificing
0.98
приорите
0.94
prioritizing
0.91
sacrific
0.90
prioritization
0.90
牺牲
0.90
sacr
0.88
prioritized
0.88
Activations Density 0.465%