INDEX
Explanations
descriptions and explanations
content-bearing words (informational/descriptive tokens) that appear in expository or factual passages.
New Auto-Interp
Negative Logits
沢
-0.06
ロー
-0.06
oods
-0.06
côt
-0.06
Alberto
-0.06
voice
-0.05
blobs
-0.05
edeki
-0.05
kvinde
-0.05
єш
-0.05
POSITIVE LOGITS
--- ↵
0.07
無しさん
0.07
.UTF
0.07
ighbour
0.07
disclosing
0.07
enci
0.07
advising
0.06
related
0.06
Jong
0.06
็นต
0.06
Activations Density 4.470%