INDEX
Explanations
phrases related to contrasting or specifying different categories or options
references to relationships and social connections
New Auto-Interp
Negative Logits
cki
-0.67
rave
-0.67
WF
-0.57
haw
-0.56
KI
-0.55
itute
-0.55
WD
-0.55
ady
-0.55
Skip
-0.53
RM
-0.52
POSITIVE LOGITS
etc
1.28
etc
1.15
whatever
0.95
ĪĴ
0.82
blah
0.80
whatever
0.76
respectively
0.75
Allah
0.74
whichever
0.74
=-=-=-=-
0.71
Activations Density 0.397%