INDEX
Explanations
terms related to virtue and virtuous behaviors
New Auto-Interp
Negative Logits
itz
-0.16
lds
-0.15
prehensive
-0.15
lectic
-0.15
atters
-0.15
xét
-0.15
á»Ĩ
-0.15
é»
-0.15
ary
-0.14
Ã¥l
-0.14
POSITIVE LOGITS
ually
0.29
uous
0.22
virt
0.20
ues
0.19
ue
0.19
oso
0.17
tual
0.17
utos
0.17
usize
0.17
udes
0.17
Activations Density 0.003%