INDEX
Explanations
distinctions and characteristics in classifications
New Auto-Interp
Negative Logits
salopes
-0.17
anou
-0.15
ár
-0.14
riad
-0.14
echang
-0.14
feder
-0.14
_Reference
-0.13
deb
-0.13
alia
-0.13
escort
-0.13
POSITIVE LOGITS
rather
0.23
just
0.20
paradox
0.19
rather
0.19
implicit
0.17
plus
0.17
contra
0.17
plutôt
0.17
bien
0.16
intuit
0.16
Activations Density 0.037%