INDEX
Explanations
action verbs followed by details or implications
New Auto-Interp
Negative Logits
в
0.48
на
0.46
л
0.45
ك
0.43
لين
0.43
아
0.43
а
0.42
ता
0.42
ts
0.42
ת
0.42
POSITIVE LOGITS
孪
0.49
eV
0.48
বীজ
0.47
sikker
0.47
蒎
0.46
säker
0.44
南北
0.44
highway
0.44
buttonAnimation
0.44
heller
0.43
Activations Density 0.002%