INDEX
Explanations
phrases indicating causality or reasoning
the word "since" in varying contexts
New Auto-Interp
Negative Logits
hack
-0.77
encia
-0.75
displayText
-0.74
dozen
-0.71
natureconservancy
-0.71
pec
-0.69
abled
-0.69
usk
-0.68
gallery
-0.67
ocaust
-0.67
POSITIVE LOGITS
rely
1.41
they
1.01
there
1.00
it
0.91
neither
0.91
nobody
0.91
we
0.83
everyone
0.77
otherwise
0.74
fewer
0.72
Activations Density 0.051%