INDEX
Explanations
humor that references specific cultural knowledge or events
New Auto-Interp
Negative Logits
undef
-0.15
mdir
-0.14
ollo
-0.14
zent
-0.14
aris
-0.14
ARI
-0.14
代
-0.14
иÑģÑĤ
-0.14
radient
-0.13
代
-0.13
POSITIVE LOGITS
detail
0.16
spotted
0.16
subtle
0.16
kke
0.15
iland
0.15
hidden
0.15
/reference
0.15
fle
0.15
èĽĽ
0.15
synchron
0.14
Activations Density 0.034%