INDEX
Explanations
proper nouns and specific phrases
New Auto-Interp
Negative Logits
orate
-0.71
raq
-0.69
alysed
-0.67
icia
-0.67
orio
-0.67
arse
-0.66
orable
-0.66
raught
-0.66
oreal
-0.64
orative
-0.63
POSITIVE LOGITS
THERE
1.03
there
0.96
neither
0.86
nobody
0.84
there
0.84
although
0.83
"[
0.78
none
0.75
THEY
0.73
ecause
0.72
Activations Density 1.969%