INDEX
Explanations
references to categorization, evaluation, or classification of concepts or individuals
New Auto-Interp
Negative Logits
Wald
-0.16
.pivot
-0.15
ØŃد
-0.15
olg
-0.15
ög
-0.14
Ñģим
-0.14
plements
-0.14
ILD
-0.14
distributed
-0.13
dziew
-0.13
POSITIVE LOGITS
pile
0.21
list
0.21
bucket
0.20
ategori
0.19
camp
0.19
bucket
0.18
pile
0.17
amarin
0.17
category
0.17
ibox
0.17
Activations Density 0.139%