INDEX
Explanations
Instruction, Human, Question
New Auto-Interp
Negative Logits
FRI
0.47
самой
0.45
tych
0.44
essen
0.43
Koch
0.42
anything
0.41
nedeni
0.41
friends
0.40
after
0.40
taux
0.40
POSITIVE LOGITS
Purpose
0.49
Purpose
0.48
प्रिल
0.46
준비
0.45
목적
0.44
脚本
0.44
问题
0.43
목
0.42
Question
0.41
작성
0.41
Activations Density 0.004%