We're super grateful that we had access to 10gen's knowledgeable engineering team on a warm sunny Autumn weekend, when they could have spent the weekend at the beach enjoying the sunshine. Nothing beats having a go-to 10gen engineer to bounce ideas off once you've googled the hell out of the Internet.
Over the course of the weekend, Herb and I have made huge gains in our understanding of core Mongo concepts as well as getting feedback on design patterns and schema design, this is a wrap up of what we've learnt (remember these are all specific to our project News Maven: http://newsmaven.co and may not apply to your project per se).
Our app in essence is an RSS aggregator and reader. We have a Mongoose model for the following: Blog, Post, User, Folder which in MongoDB maps into the following collections: blogs, posts, users, folders.
We have a worker thread, which at a specified time interval, goes and grabs the RSS feed for each blog in the system and updates our posts collection with new articles (aka 'posts').
Whilst the actual idea is rather novel, there are some interesting issues that crop up for which MongoDB really lent itself to. Consider how to store the read/unread status of a blog on a per user basis.
There are three potential designs we could consider using:
Design one: Store read state per user, per blog subscription
In this scenario we would have a model that looks like this:
var ReadSchema = new Schema({
userId: { type: ObjectId, ref: 'User' },
blogId: { type: ObjectId, ref: 'Blog' },
read: [{ type: ObjectId, ref: 'Post' }]
});
The way this would work is that each time a user reads a blog post a HTTP POST request is made to the server to indicate that a particular post has been read.
Pros:
- Really simple to insert (it would just be a FIFO)
- If some thing gets marked as unread again, no problems we just push it on to the top of the FIFO stack
Cons:
- Unread count needs to be computed on the difference between what's been read and what has still got to be read, and this would have to be done for each user, and each blog that they subscribe to - not an ideal scenario.
The pseudocode would be something like this:
function getUnread() {
get ObjectIds for the Blog's last 5000 articles;
For each ObjectId not in the Read Array, add to Unread Array;
return Unread Array and Unread.count();
}
This has to be computed for each blog the user subscribes to; its two DB reads per blog (one to fetch 5000 articles, then another to fetch the current Read Array). If a user on average has 30 blog subscriptions, thats 60 DB calls per user.
Scenario two
Model: Unread
{
userId: foreign key to a user UUID,
blogId: foreign key to a blog UUID,
unread: Array of unread items
}
In the above scenario, when a feed gets updated, we go to each user and update their unread Array.
Pros:
- Flat model
Cons:
- HUGE performance penalty (for each blog a user subscribes to, for each new article, we go and update the Array of unread items).
- Disk cost is maximum, for each blog a user subs to, we store an Array of 1000+ ObjectIds, which will kill our storage costs.
I've run out of time, but I will definitely cover the rest of this in a follow up blog post! Stay tuned.





