INDEX
Explanations
discussions about complex relationships and societal issues
New Auto-Interp
Negative Logits
wich
-0.15
oleÄį
-0.15
ucs
-0.14
ะ
-0.14
concrete
-0.14
_KIND
-0.14
inh
-0.14
lit
-0.13
polished
-0.13
Lit
-0.13
POSITIVE LOGITS
merely
0.34
mere
0.28
simplement
0.28
mere
0.28
simply
0.25
harmless
0.21
пÑĢоÑģÑĤо
0.20
Simply
0.20
åıªæĺ¯
0.19
Simply
0.19
Activations Density 0.227%