INDEX
Explanations
terms related to theoretical concepts and academic theories
New Auto-Interp
Negative Logits
itude
-0.18
ello
-0.17
stones
-0.17
itan
-0.16
ellan
-0.16
ned
-0.16
ening
-0.16
own
-0.15
engers
-0.15
acre
-0.15
POSITIVE LOGITS
ically
0.18
rence
0.17
/pr
0.16
czy
0.16
779
0.16
پرداز
0.16
ical
0.15
838
0.15
/model
0.15
ICAL
0.15
Activations Density 0.026%