INDEX
Explanations
conjunctions then new concepts
New Auto-Interp
Negative Logits
nejen
0.25
there
0.24
says
0.24
What
0.23
{0.22
they
0.22
ilyen
0.22
vorige
0.22
وت
0.21
hva
0.21
POSITIVE LOGITS
unwillingness
0.50
inability
0.44
unwavering
0.43
willingness
0.41
reliance
0.41
отсутствие
0.41
良好的
0.38
unrelenting
0.37
reluctance
0.36
возможность
0.36
Activations Density 0.321%