INDEX
Explanations
human intelligence and language
New Auto-Interp
Negative Logits
l
0.97
to
0.91
v
0.91
j
0.86
of
0.84
g
0.82
ll
0.79
is
0.77
p
0.77
ك
0.75
POSITIVE LOGITS
5
0.98
;
0.94
)
0.80
at
0.77
human
0.74
for
0.71
pharmacist
0.69
.
0.68
]
0.68
ר
0.68
Activations Density 0.043%