INDEX
Explanations
words related to significant events or actions
words indicating significant actions or events
New Auto-Interp
Negative Logits
rea
-0.67
conn
-0.66
leases
-0.64
rone
-0.63
aly
-0.63
zh
-0.61
lords
-0.60
rel
-0.60
Newman
-0.60
sw
-0.59
POSITIVE LOGITS
ometimes
1.05
hift
0.95
paces
0.93
omething
0.87
heet
0.86
creen
0.84
ilver
0.83
pace
0.80
hirt
0.79
psey
0.71
Activations Density 0.655%