INDEX
Explanations
questions starting with 'how'
New Auto-Interp
Negative Logits
advertisement
-0.68
76561
-0.67
iculture
-0.67
Supplement
-0.66
UME
-0.66
ograph
-0.63
icipated
-0.61
izu
-0.59
UM
-0.58
inian
-0.58
POSITIVE LOGITS
much
1.07
ls
0.97
beit
0.94
prevalent
0.93
messed
0.92
far
0.89
resilient
0.89
MUCH
0.86
fragile
0.85
itzer
0.82
Activations Density 0.068%