INDEX
Explanations
comparative phrases about effectiveness or preference
New Auto-Interp
Negative Logits
ysa
-0.16
ãĤ¥
-0.16
Leer
-0.15
rious
-0.15
ocoa
-0.15
pta
-0.14
ç·ı
-0.14
uitka
-0.14
itra
-0.14
rror
-0.14
POSITIVE LOGITS
chamber
0.15
ington
0.15
æŃ
0.14
wg
0.14
خش
0.14
evin
0.14
hoff
0.14
ISO
0.13
hazard
0.13
Clifford
0.13
Activations Density 0.175%