INDEX
Explanations
phrases or conjunctions that suggest conditional or contrasting relationships
New Auto-Interp
Negative Logits
#ad
-0.17
ify
-0.15
ılı
-0.15
kick
-0.14
stub
-0.14
Ìģt
-0.14
utar
-0.14
šky
-0.14
nackte
-0.13
_COMPAT
-0.13
POSITIVE LOGITS
wed
0.13
yna
0.13
yen
0.12
braska
0.12
im
0.12
superv
0.12
151
0.12
z
0.11
ī
0.11
159
0.11
Activations Density 0.130%