MattockFS; Computer-Forensics File-System : Part Eighth (last installment)

pibara (60)in #forensics • 7 years ago

This post is the last an eight-part series regarding the MattockFS Computer-Forensics File-System. This series of post is based on the MattockFS workshop that I gave at the Digital Forensics Research Workshop in Überlingen Germany earlier this year.

If you missed any of the previous installments, they are available here:

Today I want to look at MattockFS from three perspectives that I haven't touched on in the previous installments. These perspectives are:

MattockFS as an open source project.
MattockFS as a building block for a secure scalable distributed computer forensic frameworks architecture.
MattockFS as a generic building-block for secure and robust scalable distributed data processing architectures.

Currently, MattockFS is a one-man open source project with me as the only developer. MattockFS was incepted as a proof of concept during my UCD FCCI M.Sc research project. You can find my minor thesis here. After my research project ended, finishing MattockFS became my personal open source pet-project. While during my research project I had the opportunity to allocate substantial time to the project, at the moment I need to divide my limited spare time between MattockFS, other open-source projects including RumpelstiltskinFS, RAMCoin and the Croupierbot, being a dad, writing speculative fiction, trying to keep up with nutritional science and being an ammateur power-lifter. Fair to say progress has slowed down a bit since finishing my study and I could really use a hand at getting things out of beta.

The above is a list of things that still need work.

Most importantly there are currently still three core features that are needed for the Python reference implementation of MattockFS. Once these three features are completed and thoroughly tested, the current implementation of MattockFS could finally move out of beta. The first of these features is one that I could really use some extra eyeballs on. I've been postponing activating the restore-from-journal code given that the existing restore-from-journal is broken and I've been unable to pinpoint the source of the problems. So if you are skilled at Python and have a few hours to spare, your eyeballs would be extremely welcome on this. Two other features that are currently still missing are secondary opportunistic hashing and the ability to temporarily quarantine data entities that are suspect in having made a module crash. Once all three of these features are complete, MattockFS should be considered functionally complete.

MattockFS currently only has language bindings for Python. I will start working on C++ language bindings as soon as MattockFS is functionality complete, but language bindings for other languages should be easy enough to implement. So if you have a favorite language for digital forensic tool development that isn't Python or C++, please consider writing MattockFS language bindings for that language.

Given my own limited time resources, I am unlikely to have substantial time to work on either a full-fledged module library or a kickstarting and load balancing mashup on top of MattockFS. I'll be discussing both these components in today's blog post and I hope this post might inspire you to pick up a project like that as a personal pet project. Note that while I won't have much time to work on your project, I will give priority to any MattockFS related queries to help you make your subsystem interact with MattockFS, both in terms of support and if needed bug fixes.

Finally, I hope to eventually find the time to do a full port of MattockFS to C++ with as main goal improved system performance.

So enough about MattockFS as an open-source project. Let us look how MattockFS fits into a potentially distributed computer forensic architecture for medium to large-scale. The above diagram shows the five main components that a MattockFS based asynchronous computer forensic framework will consist of. MattockFS combined with a forensic module framework together make up what could be a single node computer forensic framework setup. The mentioned framework, depending on the language and concurrency model chosen could potentially be nothing more than a higher-level API on top of the low-level API language bindings we discussed earlier.

While a single node architecture is fun for research purposes or small scale investigations, it really doesn't scale to medium or large-scale computer forensics investigations involving hundreds of pieces of forensic disk images spanning many dozens of terabytes of unique data. In order to facilitate such investigations, we need to support multi-node setups while still respecting a locality of data based approach with respect to migration of tool-chains to other nodes. There are two components needed for letting the setup work in a distributed setting and a third component needs to be made distributed processing aware.

MattockFS currently supports a trivial form of an NFS based storage mash-up. It is likely that future versions of MattockFS will want to made to fit against different types of storage mash-up solutions, for example using erasure encoding based distributed data archives. MattockFS provides hooks for allowing a load-balancing mash-up peer to peer network to allow CPU intensive jobs to lead to tool-chain migration to less heavily loaded nodes. Finally, a networked kickstart able to interact with the load-balancing mash-up should be able to kickstart to the most appropriate node. That is, the node with the lowest amount of currently hot data in comparison to the amount of RAM available for page-cache purposes on the node.

By default, MattockFS will start a number of instances on startup. Each with its own mount-point. This feature is meant for two purposes. Secondly, for the current Python implementation, this feature allows multi-core systems to allocate more CPUs to have the user-space file-system not become a single core bottleneck. Once we swap to C++ and the multi cpre implementation of BLAKE2, this should become a non-issue though. The primary reason for starting multiple instances of MattockFS is to accommodate the storage mash-up. It is suggested the mount points are set to map to local storage for the 0 mountpoint and to a different remote NFS mount for all others.

Now with respect to kickstarting data into a distributed setup, the above considerations should be considered pivotal.

Now as for the core of the load balancing setup, the load-balancing peer to peer mash-up. MattockFS provides hooks for load balancing modules to steal jobs from the queues of other modules. It is important to note that migrating toolchains to other nodes will create some difficulties with fitting the chain of evidence back together from the different provenance logs on the different nodes. Basically, the data chunk together with the old toolchain its router state gets used to instantiate a whole new toolchain on the remote node while the toolchain is closed on its original node.

Now for the forensic module framework or the forensic module framework library. This is a framework or a library for running actual computer forensic modules on top of the MattockFS language bindings. It is important to note that this framework or this library will likely need a different implementation for each computer programming language that we want to support.

Here we have a rough outline of how the forensic module framework will be built up internally. I'll discuss each of the core components in some detail.

First, what is a magic library and why would we need one? In the Open Computer Forensics Architecture (OCFA), there was an important module called the file module. A module that would attempt to determine the type of a file or data chunk using header information stored in a so-called magic database. In my analysis of OCFA message flow and performance, it turned out that a whole lot of data got routed to the file module, only to be discarded immediately if the file was of a type that would be of little interest to the investigation at hand. To avoid spurious messaging, integration of the magic file checking functionality into the core forensic module framework is an important measure.

The impact of data serialization technology on system performance should not be underestimated. In OCFA, the use of XML and XSLT turned out to have quite some performance costs. The serialization chosen should be fast, have a limited memory footprint and should not be programing language dependent. That is, the serialization format used by the forensic module framework should be the same for all supported languages.

Now we get to one of the most important and one of the most challenging aspects of the forensic module framework. The concept of distributed router logic. For anyone doing an M.Sc in computer forensics looking for an interesting research topic for your dissertation, this is where you might want to start paying extra attention. In the Open Computer Forensics Architecture, the router was a central architectural process or set of processes. An important consequence of this design was that th number of messages needed to route a job from one actor to the next would be double the number that would be needed if the router logic would have been distributed between all the modules. That is, every module forwarded the job it had just processed and any child toolchain jobs to the central router and it was that central router that would parse the new meta data and figure out where the job should be routed to next.

The first OCFA router was a stateless router modelled after IPTABLES. That all changed when the FIVES project, a project that was build around OCFA, introduced an alternative router that used tool-chain meta-data piggybacked state in order to make the router logic statefull in a way that still allowed the paralellisation provided by the OCFA framework to do its job. Conceptual this FIVES approach to routing state maps perfectly to the concept of a fully distributed router as part of the forensic module framework. To accomodate this, the MattockFS API provides room in its messaging for limited size per-toolchain distributed router state.

It is important to realize that distributed routing is not optional when using MattockFS. Distributed routing is an intrincic part of the MattockFS approach to its local message-bus design and load balancing hooks. The forensic module framework must implement distributed routing logic, and the exact efficient setup of this should currently be considered an important subject for further study. A great subject I believe for someone looking for an interesting M.Sc research subject.

While the language bindings provide a low level API, this API should not be considered suitable for regular framework modules. The forensic module framework should provide a simple module framework to workers. OCFA started out with a trivial API that, while suitable for the simplest of modules was insuficient for more advanced modules. OCFA later added an alternative tree-graph oriented API that while still simple allowed for far more powerfull data and meta-data extraction. Old API modules in OCFA were never ported, but in retrospect OCFA woul have been more consistent if only the tree-graph based API would have been used. I thus would like to suggest that the API for forensic module workers should be based on a tree graph oriented API not unlike the treegraph OCFA API.

One final, but very much essential part of functionality the forensic module framework will need to implement is throttling. While MattockFS provides hooks for querying if throttling might be needed at a particular point in time, MattockFS doesn't do any throttling itself. The forensic module framework should combine the info provided by MattockFS with the information provided by the Linux kernel in order to implement a sane throttling pollicy for submitting new hot data to the MattockFS archive and message bus.

Anyone wanting to implement a forensic module framework on top of MattockFS should consider first forking and renaming The Mediocre Forensic Module Framework. This Python code-base should provide something of a basis for the further development of an actual framework. MFMF does NOT implement any throttling or statefull routing, but appart from that has most of what an actual forensic module framework should have.

Please note that while in this series of posts, MattockFS has been discussed only in the context of computer forensics. MattockFS however is completely agnostic with respect to the data it is used to archive and refer to in its messaging setup. MattockFS, especialy when combined with Mandatory Access Controlls, provides for a relatively generic high integrity message bus solution for data intensive message passing concurency with limited mutability. The potential for usability outside of a computer forensics setting should be further explored and I am very much open to feature requests that might seriously contribute to making MattockFS usable in non-forensic setups.

I hope you have enjoyed this series of posts and if you have, please consider if (and how) you could potentially contribute to the open source development of MattockFS, additional language bindings for MattockFs, a computer forensics module forensics as outlined in this last post, a better storage meshup or a load balancing and networked kickstarting mash-up.

I feel strongly that MattockFS could contribute a lot to a new generation of data intensive asynchounous frameworks with serious system and data integrity requirements, most notably in the field of computer forensics, but potentially alsou in other fields where these properties are important.

Please comment below if you have any feedback you would like to share with me, or if you have any questions.

#tutorial #technology #security #dfrws