INDEX
Explanations
words related to evidence or documentation
New Auto-Interp
Negative Logits
ild
-0.60
las
-0.58
éĸ
-0.55
chwitz
-0.55
mun
-0.54
othe
-0.53
ãĤ§
-0.53
ror
-0.53
ron
-0.52
irable
-0.52
POSITIVE LOGITS
theless
0.59
everywhere
0.58
consisted
0.54
Everywhere
0.50
onwards
0.49
lasted
0.47
consists
0.45
notwithstanding
0.44
extensively
0.44
misled
0.43
Activations Density 1.294%