INDEX
Explanations
references to concepts related to morality and ethics
concepts and discussions surrounding morality and ethical principles
New Auto-Interp
Negative Logits
eding
-0.81
WER
-0.79
berry
-0.72
ept
-0.72
eds
-0.72
eworld
-0.71
upon
-0.71
location
-0.70
aways
-0.68
eded
-0.66
POSITIVE LOGITS
contag
0.91
ocracy
0.83
guiActiveUn
0.78
morality
0.75
onomic
0.74
anship
0.73
Petr
0.70
srfAttach
0.69
ethics
0.69
onom
0.68
Activations Density 0.009%