INDEX
Explanations
proper nouns, particularly names of people
New Auto-Interp
Negative Logits
Houſe
-0.87
Perſ
-0.86
Conſ
-0.84
Diſ
-0.82
Reſ
-0.79
Shaksp
-0.79
houſe
-0.78
pleaſure
-0.74
Theſe
-0.73
greateſt
-0.73
POSITIVE LOGITS
")));
1.05
]));
0.97
'])){
0.97
')));
0.95
Искәрмәләр
0.91
'));
0.89
/$',
0.87
"];
0.84
}';
0.84
'))
0.83
Activations Density 0.384%