INDEX
Explanations
instances of a specific character or symbol within the text
New Auto-Interp
Negative Logits
behaviors
-0.20
rumor
-0.17
neighbors
-0.17
CENTER
-0.17
harbor
-0.17
neighbor
-0.17
Flavor
-0.16
rumors
-0.16
honored
-0.16
honor
-0.16
POSITIVE LOGITS
global
0.18
personalised
0.18
isers
0.17
international
0.17
OT
0.16
Western
0.16
travelling
0.16
international
0.16
fusion
0.15
Western
0.15
Activations Density 0.005%