INDEX
Explanations
phrases indicating knowledge or familiarity with a topic
statements indicating shared knowledge or common understanding among the audience
New Auto-Interp
Negative Logits
oreal
-0.80
cific
-0.73
ngth
-0.69
streng
-0.69
rontal
-0.68
vati
-0.67
orthy
-0.66
ongevity
-0.65
ihad
-0.64
bably
-0.64
POSITIVE LOGITS
tale
0.76
about
0.68
how
0.67
that
0.67
by
0.63
why
0.60
anton
0.59
tales
0.58
Ced
0.57
what
0.56
Activations Density 0.126%