INDEX
Explanations
words associated with conscience and moral responsibility
New Auto-Interp
Negative Logits
kon
-0.17
pillar
-0.16
Ìĥ
-0.16
utto
-0.15
rea
-0.15
à¹Īว
-0.14
ignet
-0.14
ilar
-0.14
aping
-0.14
iled
-0.14
POSITIVE LOGITS
front
0.28
front
0.20
dem
0.19
-front
0.19
desc
0.19
sequ
0.18
Front
0.17
descending
0.17
science
0.17
-dem
0.17
Activations Density 0.025%