INDEX
Explanations
words related to items that are processed or transformed into something else
references to future events and expected outcomes
New Auto-Interp
Negative Logits
emphasizing
-0.68
emphasizes
-0.63
SOURCE
-0.63
SourceFile
-0.62
Intervention
-0.62
injecting
-0.61
imposing
-0.61
mindful
-0.60
Respons
-0.59
cffffcc
-0.59
POSITIVE LOGITS
fetch
1.10
expire
1.08
circulate
1.07
langu
1.07
belonged
1.06
belong
1.00
disappear
0.95
vanish
0.95
undergo
0.94
arrive
0.93
Activations Density 0.313%