INDEX
Explanations
discussions about social responsibility and moral dilemmas
New Auto-Interp
Negative Logits
uide
-0.14
Lester
-0.13
678
-0.13
Ezra
-0.13
cou
-0.13
iasi
-0.13
èµ·
-0.13
824
-0.12
SCP
-0.12
/Instruction
-0.12
POSITIVE LOGITS
åĨµ
0.17
Nor
0.15
æ³ģ
0.15
nor
0.15
illac
0.15
Plus
0.15
Anyway
0.15
Plain
0.14
plain
0.14
buzz
0.14
Activations Density 0.169%