INDEX
Explanations
phrases expressing moral or ethical correctness
New Auto-Interp
Negative Logits
TestMethod
-0.14
RefPtr
-0.14
ÑĨей
-0.14
nds
-0.14
åŀĭ
-0.13
cio
-0.13
utters
-0.13
ä¸ĢæŃ¥
-0.13
оиÑĤ
-0.13
ê´Ģ리ìŀIJ
-0.13
POSITIVE LOGITS
thing
0.98
things
0.85
Thing
0.80
thing
0.77
Thing
0.71
Things
0.69
Things
0.66
things
0.66
cosas
0.58
cosa
0.57
Activations Density 0.242%