Data Expiration With MongoDB

in #mongodb3 years ago

Four years before I recorded the following video MongoDB: Capped Collections and Data Expiration (around 2010 was my original prediction), I wrote that the rise of social media would eventually encourage people to participate in platforms that favored small data over large data. The reason I made this prediction was because people were only beginning to see how social media was using their data. As more people became familiar with this, the rise of platforms which didn't store much data would become popular eventually. Snapchat was released a year later, but wasn't popular at the time. With the younger generation, it's become immensely popular and one reason is that younger people like a "non-history" of activities.

That's one example of a why for this technical solution. MongoDB provides us with a great tool for privacy - capped collections and data expiration. Not only does this feature help us minimize the size of our environment (saving us resources), it can be used for tools where we only want to keep records around for a period of time automatically.

Some questions that are answered in the video:

  • What is one reason that we may use a capped collection or use the feature of data expiration mentioned in the video?
  • What is another use case that you can think of based on the example?
  • In the example video, what is one technique we use to verify that MongoDB "expired" the data?
  • What is another way we can verify that data are not stored?
  • What do I note about the performance of a capped collection and where might this benefit us?
  • What is the final point I make about optionality and how could we use this in our architecture?
As a related data note, the idea of storing history can often be more distractive than accurate for predictions (often the rationalization for storing historical data). In addition, the costs may not offset the benefits. For instance, the cost of storing a person's data, if compromised, may be significantly more than not storing a person's data. In my years of research and predictions involving people, I've rarely used more than 2,000-5,000 data points to make my predictions and they've been extremely accurate. None of these research points involved storing people's private information. The point I'm ultimately making here is that storing many data points doesn't necessarily mean better research or predictive accuracy; in fact, the opposite may be true. MongoDB capped collections are one way of enforcing this automatically.

Automating ETL
For mastering data transformation from one form to another form, check out the highest-rated Automating ETL course on Udemy. For a coupon to the course, check out the trailer video on the channel SQL In Six Minutes.

We should consider one security point here with capped collections. In general, the security of a capped collection will be stronger, because a hacker at any given moment can only get access to the present data. If our capped collection, as an example, only stored the healthcare data of a person for that month, in a compromise only that period of time would be leaked or compromised. However, if a hacker remains in our system for a period of time and monitors our data for a later leak, they may be able to get access to more of our data. This also becomes a challenge for them as well because the data aren't present in our system, forcing them to find a way to extract the data outside our system. By contrast, in a database where we're storing the full history, the attacker can devise a plan to extract the data later. Still, we should be extremely careful about assuming that a capped collection will limit a hacker's ability to leak data. We can use some technial features to possible disrupt attackers here that makes this a huge challenge for them. MongoDB's feature here does offer some significant security advantages for some use cases over recording historic data.

Privacy are security are two use cases of capped collections and data expiration, but there are numerous other use cases. We may want to use capped collections for daily documents, where we only want the last day's or last week's document. Another use case is data restriction in general: we may have a limited environment and want to enforce that by forcing some data to become irrelevant in time. To the earlier point about data and research - historic data is not always important for us to retain, so provided that we know our use case, allowing data to expire may be a useful feature we want to consider.

As a quick note for anyone who tries to imply privacy means that people are doing something wrong (ie: a well-known former CEO of a major tech company). Every person who argues against privacy has locks on their door, passwords on their accounts, and associates with groups of people privately. This is not because they're doing something wrong, but because there's security with privacy. If you had no password on your email, any number of hackers could compromise you. The same applies if you didn't have locks on your door. The feature of privacy - which we can customers extra for if it's a big value and it costs - significantly strengthens customer's security. This isn't always easy to develop, though the solution above costs less than the alernative. However, giving people a choice has many advantages and as we continue to see data leaks (and more are coming), people will see the value in companies storing very little of their data.

Are you looking for tech consultants that can assist with design and development? From building custom applications to working with existing infrastructure that's causing you trouble, we can get you connected to consultants who can assist. You can contact for assistance.

SQL In Six Minutes (YouTube) | SQL In Six Minutes (Odysee) | Automating ETL | T-SQL In 2 Hours | Consumer Guide To Digital Security

Coin Marketplace

STEEM 0.19
TRX 0.14
JST 0.030
BTC 64381.21
ETH 3475.67
USDT 1.00
SBD 2.50