Many programs can benefit from shared context. You can push all the shared context to your database, but it's often helpful to keep around some shared data structures for performance reasons. For example, you can cache pure Python functions using functools.lrucache, and share such caches between threads, but such caches can't be shared between processes. In-process data structures like dicts and lists are much, much faster than alternatives like memcached and redis because they avoid the overhead of IPC and deserialization, and they are also easier to use since they are built-in.
dicts and lists are fast, but that doesn't mean a threading approach which must protect your dicts and lists with various kinds of locking will be that fast, because your program will now be waiting on locks. What are you trying to do with memcached that it is not fast enough?
> What are you trying to do with memcached that it is not fast enough?
Fine-grained caching of objects that correspond to DB rows. Most pages touch hundreds of DB rows, due to the various relationships between objects. With memcached, you have to cache at a higher granularity and contort your code quite a bit to reduce the number of gets per request.
> dicts and lists are fast, but ... your program will now be waiting on locks
In my experience, the overhead of locking is often negligible. In Java-land, you can have millions of lock operations per second. IPC involves serialization, deserialization, and context switching, in addition to actual work. Most IPC routines are built on locks, anyway.
The overhead of locking itself (which nobody has even mentioned) is completely distinct from the impact of waiting on locks. It's totally irrelevant that you can have millions of lock operations per second if you actually have any shared state to protect. If you are actually USING locks then you have threads waiting on them.