INDEX
Explanations
explanation: contrast, comparison, metaphor, assumption
New Auto-Interp
Negative Logits
各类
0.73
Apps
0.64
各种
0.64
современных
0.64
各種
0.61
bagai
0.60
CRUD
0.60
třeba
0.59
các
0.59
dalších
0.59
POSITIVE LOGITS
namely
0.66
namelijk
0.64
involunt
0.62
undermined
0.61
undermines
0.61
elicited
0.60
falsehood
0.59
undermine
0.59
和一个
0.58
pretended
0.57
Activations Density 0.004%