INDEX
Explanations
primary intention or process
assistant/model-response segments within a chat transcript (i.e., content from the model’s turn rather than the user’s).
New Auto-Interp
Negative Logits
Neighbourhood
0.41
thomas
0.40
নিঃসন্দেহে
0.40
pozycji
0.39
THOMAS
0.39
obstáculos
0.39
Loksatta
0.39
alltid
0.38
玹
0.38
berlin
0.37
POSITIVE LOGITS
Highlander
0.49
linien
0.47
معين
0.46
પતિ
0.46
itul
0.44
нюю
0.43
ii
0.42
idas
0.42
يك
0.41
उच्च
0.41
Activations Density 15.583%