INDEX
Explanations
internet culture descriptors
New Auto-Interp
Negative Logits
fucking
0.55
fuck
0.48
🪄
0.46
asshole
0.46
nigga
0.46
prettier
0.46
piss
0.44
伱
0.44
charmed
0.44
🤎
0.44
POSITIVE LOGITS
potatoes
0.56
🥔
0.54
Potatoes
0.52
potato
0.50
potato
0.49
Potato
0.48
Potato
0.47
🦄
0.46
🥑
0.46
ams
0.45
Activations Density 0.015%