INDEX
Explanations
words related to consistent behavior or persistence
phrases indicating consistency or permanence
New Auto-Interp
Negative Logits
SG
-0.74
iants
-0.74
LAN
-0.73
IDA
-0.72
Sunder
-0.68
IDs
-0.68
Tags
-0.67
OW
-0.67
atorium
-0.66
hole
-0.66
POSITIVE LOGITS
entimes
0.88
appreciated
0.83
theless
0.81
behaved
0.76
conclud
0.76
always
0.74
forg
0.73
obey
0.73
sensed
0.72
evolving
0.72
Activations Density 0.025%