INDEX
Explanations
references to visual media and links to supplementary materials
New Auto-Interp
Negative Logits
ambi
-0.14
íį¼
-0.12
ozem
-0.12
rug
-0.12
\Json
-0.12
ë»
-0.12
ëŀĺìĬ¤
-0.12
stav
-0.11
ynam
-0.11
rve
-0.11
POSITIVE LOGITS
below
0.59
above
0.47
below
0.45
BELOW
0.42
beneath
0.40
blow
0.38
ниже
0.36
underneath
0.36
Below
0.36
bel
0.35
Activations Density 0.142%