INDEX
Explanations
phrases indicating uncertainty or lack of knowledge
New Auto-Interp
Negative Logits
yc
-0.15
воÑĤ
-0.14
ertain
-0.14
erten
-0.14
indh
-0.14
byn
-0.13
chrom
-0.13
åłĤ
-0.13
responseBody
-0.13
ista
-0.13
POSITIVE LOGITS
if
0.24
how
0.24
exactly
0.21
why
0.21
where
0.20
whether
0.19
anyone
0.18
what
0.18
if
0.18
anybody
0.18
Activations Density 0.051%