INDEX
Explanations
multi-lingual abstract concepts
New Auto-Interp
Negative Logits
bao
0.39
promov
0.36
desta
0.35
verb
0.35
voz
0.35
non
0.35
motto
0.35
tomb
0.34
g
0.34
glorified
0.34
POSITIVE LOGITS
ar
0.54
sthe
0.53
el
0.52
an
0.50
ות
0.48
н
0.48
y
0.48
al
0.48
ofthe
0.48
k
0.47
Activations Density 0.001%