INDEX
Explanations
words related to separation and distinctiveness
New Auto-Interp
Negative Logits
wonders
-0.68
enegger
-0.68
ãĥ¥
-0.64
herty
-0.61
nz
-0.61
mA
-0.59
bye
-0.58
rouse
-0.57
notation
-0.57
etics
-0.56
POSITIVE LOGITS
separating
0.80
sexes
0.77
between
0.77
hairs
0.76
apart
0.75
owship
0.74
Between
0.73
from
0.72
aration
0.72
icular
0.71
Activations Density 0.045%