INDEX
Explanations
repeated mentions of the word "any"
New Auto-Interp
Negative Logits
elig
-0.69
isable
-0.69
dilig
-0.66
fung
-0.66
PDATE
-0.60
unemploy
-0.60
ITNESS
-0.60
reditary
-0.60
leptin
-0.59
surv
-0.58
POSITIVE LOGITS
where
1.03
one
0.92
ika
0.92
body
0.87
emi
0.83
uan
0.83
elle
0.81
heter
0.81
thood
0.80
THING
0.80
Activations Density 0.005%