INDEX
Explanations
words relating to principles, ethics, or moral considerations
New Auto-Interp
Negative Logits
myster
-0.78
vulner
-0.76
sacrific
-0.74
limb
-0.74
mathemat
-0.73
writ
-0.73
trainers
-0.70
conduc
-0.69
builders
-0.69
destro
-0.68
POSITIVE LOGITS
ï¸ı
1.31
vernment
1.04
SpaceEngineers
0.95
lean
0.95
log
0.92
ove
0.91
ï¸
0.91
ËĪ
0.90
deg
0.89
âĹ¼
0.89
Activations Density 0.036%