INDEX
Explanations
phrases indicating existence or presence of certain entities within a context
New Auto-Interp
Negative Logits
للمعارف
-0.75
Neville
-0.68
lemb
-0.65
kereszt
-0.64
piş
-0.64
Vertrauen
-0.64
Cardona
-0.63
houſe
-0.63
geloof
-0.62
bVar
-0.61
POSITIVE LOGITS
der
1.09
Der
0.97
Die
0.95
Οι
0.94
Der
0.93
Die
0.91
DER
0.89
dieser
0.89
ihrer
0.85
Οι
0.84
Activations Density 0.027%