INDEX
Explanations
phrases indicating agreement or alignment
statements and assertions about equivalence or similarity across different subjects or contexts
New Auto-Interp
Negative Logits
zos
-0.73
ded
-0.60
watch
-0.59
ichen
-0.55
urat
-0.54
interrupts
-0.54
ffff
-0.54
stru
-0.53
agram
-0.52
NetMessage
-0.52
POSITIVE LOGITS
ï¸ı
0.83
Nationwide
0.71
everywhere
0.66
unity
0.64
pn
0.63
ivism
0.62
oat
0.62
applies
0.61
sburgh
0.61
blance
0.61
Activations Density 0.188%