INDEX
Explanations
comparing features in tables
New Auto-Interp
Negative Logits
ouvoir
0.44
िनय
0.43
первых
0.41
oforte
0.41
ים
0.40
用の
0.40
füh
0.39
¬
0.39
→</
0.38
’:
0.38
POSITIVE LOGITS
Feature
0.57
0.55
0.54
Features
0.54
0.52
특징
0.50
Characteristics
0.49
Features
0.49
FEATURE
0.49
0.49
Activations Density 0.010%