INDEX
Explanations
repeated patterns or consistencies
the word "consistently" and similar phrases indicating reliability or frequency
New Auto-Interp
Negative Logits
ster
-0.82
sters
-0.74
Mour
-0.67
soc
-0.67
eral
-0.65
tein
-0.64
OTOS
-0.63
ja
-0.62
wife
-0.62
Jac
-0.62
POSITIVE LOGITS
ãĤ©
0.81
outper
0.81
worshipped
0.77
conclud
0.76
rated
0.75
underestimated
0.75
impressed
0.73
maintained
0.72
nces
0.71
evolve
0.71
Activations Density 0.024%