INDEX
Explanations
descriptive, companion, loads, vast, Prints
It activates on tokens from the assistant/model's long, contentful instructional or explanatory responses (i.e., tokens in model-generated explanatory text).
New Auto-Interp
Negative Logits
আলোচ
0.46
Palash
0.45
кои
0.44
ToProps
0.43
)».
0.43
assapi
0.43
बालिका
0.42
什麼
0.42
這邊
0.42
ബാല
0.42
POSITIVE LOGITS
'
0.46
0.40
<body>
0.39
g
0.39
குழு
0.39
ণ
0.38
dit
0.37
passos
0.37
Group
0.37
0.36
Activations Density 0.701%