Tool for creating / distributing / maintaining symlink farms.
FarmFS is still very early stage software.
Please do not keep anything in it which you are not willing to lose.
pip install git+https://github.com/andrewguy9/farmfs.git@master
git clone https://github.com/andrewguy9/farmfs.git
cd farmfs
python setup.py install
FarmFS
Usage:
farmfs mkfs
farmfs (status|freeze|thaw) [<path>...]
farmfs snap (make|list|read|delete|restore) <snap>
farmfs fsck
farmfs count
farmfs similarity
farmfs gc
farmfs checksum <path>...
farmfs remote add <remote> <root>
farmfs remote remove <remote>
farmfs remote list
farmfs pull <remote> [<snap>]
Options:
Farmfs is a git style interface to non text, usually immutable, sometimes large files. It takes your files and puts them into an immutable blob store then builds symlinks from the file names into the store.
- You can snapshot your directory structure BIG_O(num_files).
- You can diff two different farmfs stores with BIG_O(num_files) rather than BIG_O(sum(file_sizes))
- You can identify corruption of your files because all entries in the blob store are checksumed.
- If the same file contents appear in multiple places you only have to put it in the blob store once. (deduplication)
Create a Farmfs store
mkdir myfarm
cd myfarm
farmfs mkfs
Make some files
mkdir -p 1/2/3/4/5
mkdir -p a/b/c/d/e
echo "value1" > 1/2/3/4/5/v1
echo "value1" > a/b/c/d/e/v1
Status can show us unmanged files.
farmfs status
/Users/andrewguy9/Downloads/readme/1/2/3/4/5/v1
/Users/andrewguy9/Downloads/readme/a/b/c/d/e/v1
Add the untracked files to the blob store. Notice it only needs to store "value1" once.
farmfs freeze
Processing /Users/andrewguy9/Downloads/readme/1/2/3/4/5/v1 with csum /Users/andrewguy9/Downloads/readme/.farmfs/userdata
Putting link at /Users/andrewguy9/Downloads/readme/.farmfs/userdata/238/851/a91/77b60af767ca431ed521e55
Processing /Users/andrewguy9/Downloads/readme/a/b/c/d/e/v1 with csum /Users/andrewguy9/Downloads/readme/.farmfs/userdata
Found a copy of file already in userdata, skipping copy
Edit a file. First we need to thaw it, then we can change it.
farmfs thaw 1/2/3/4/5/v1
farmfs status
/Users/andrewguy9/Downloads/readme/1/2/3/4/5/v1
echo "value2" > 1/2/3/4/5/v1
farmfs freeze 1/2/3/4/5/v1
Processing /Users/andrewguy9/Downloads/readme/1/2/3/4/5/v1 with csum /Users/andrewguy9/Downloads/readme/.farmfs/userdata
Putting link at /Users/andrewguy9/Downloads/readme/.farmfs/userdata/4ca/8c5/ae5/e759e237bfb80c51940de7a
farmfs status
We don't want to loose our progress, so lets make a snapshot.
farmfs snap make mysnap
Now create more stuff
echo "oops" > mistake.txt
farmfs freeze mistake.txt
Processing /Users/andrewguy9/Downloads/readme/mistake.txt with csum /Users/andrewguy9/Downloads/readme/.farmfs/userdata
Putting link at /Users/andrewguy9/Downloads/readme/.farmfs/userdata/38a/f5c/549/26b620264ab1501150cf189
Well that was a mistake, lets roll back to the old snap.
farmfs snap restore mysnap
Removing /mistake.txt
Now that we have our files built, lets build another depot.
cd ..
mkdir copy
cd copy
farmfs mkfs
We want to add our prior depot as a remote.
farmfs remote add origin ../myfarm
Now lets copy our work from before.
farmfs pull origin
mkdir /1
mkdir /1/2
mkdir /1/2/3
mkdir /1/2/3/4
mkdir /1/2/3/4/5
mklink /1/2/3/4/5/v1 -> /4ca/8c5/ae5/e759e237bfb80c51940de7a
Blob missing from local, copying
*** /Users/andrewguy9/Downloads/copy/.farmfs/userdata/4ca/8c5/ae5/e759e237bfb80c51940de7a /Users/andrewguy9/Downloads/myfarm/.farmfs/userdata/4ca/8c5/ae5/e759e237bfb80c51940de7a
mkdir /a
mkdir /a/b
mkdir /a/b/c
mkdir /a/b/c/d
mkdir /a/b/c/d/e
mklink /a/b/c/d/e/v1 -> /238/851/a91/77b60af767ca431ed521e55
Blob missing from local, copying
*** /Users/andrewguy9/Downloads/copy/.farmfs/userdata/238/851/a91/77b60af767ca431ed521e55 /Users/andrewguy9/Downloads/myfarm/.farmfs/userdata/238/851/a91/77b60af767ca431ed521e55
Lets see whats in our new depot:
find *
1
1/2
1/2/3
1/2/3/4
1/2/3/4/5
1/2/3/4/5/v1
a
a/b
a/b/c
a/b/c/d
a/b/c/d/e
a/b/c/d/e/v1
Regression tests can be run with pytest
Tests are kept in the tests directory, which will be detected by pytest automatically.
Performance testing cases are stored under the perf directory. These are useful for making development decisions are not generally useful as ongoing tests.
These tests can by run using pytest or tox.
pytest:
To run a particular trial run:
pytest -s perf/your_test.py [-k case_pattern].
Notice that the -s is required to get a printout of the results.
Example: pytest -s perf/transducer.py -k transducers
tox:
To run a pattern in a particular environment run:
-
tox -e [envs] -- [-k case_pattern] -
Available envs are
{py37,py39,pypy,pypy3}-perf
Example: tox -e py37-perf,py39-perf -- -k transducers
farmfs comes with a useful debugging tool farmdbg.
farmdbg
Usage:
farmdbg reverse <csum>
farmdbg key read <key>
farmdbg key write <key> <value>
farmdbg key delete <key>
farmdbg key list [<key>]
farmdbg walk (keys|userdata|root|snap <snapshot>)
farmdbg checksum <path>...
farmdbg fix link <file> <target>
farmdbg rewrite-links <target>
farmdbg can be used to dump parts of the keystore or blobstore, as well as walk and repair links.
Compose has less function call overhead than pipeline because we flatten the call chain. There are fewer wrapper functions.
cincs = compose(*incs)
timeit(lambda: cincs(0))
0.45056812500001797
pincs = pipeline(*incs)
timeit(lambda: pincs(0))
0.8594365409999227
When dealing with chained iterators, pipeline and compose have the same performance. Pulling from an iterator is faster than mixing in composed function calls, even with fmap overhead.
csum = compose(fmap(inc), fmap(inc), fmap(inc), sum)
timeit(lambda: csum(range(1000)), number=10000)
1.2722054580000304
csum2 = compose(fmap(compose(inc, inc, inc)), sum)
timeit(lambda: csum2(range(1000)), number=10000)
2.0529240829999935
psum = pipeline(fmap(inc), fmap(inc), fmap(inc), sum)
timeit(lambda: psum(range(1000)), number=10000)
1.273805500000094
psum2 = pipeline(fmap(pipeline(inc, inc, inc)), sum)
timeit(lambda: psum2(range(1000)), number=10000)
2.7146950840000272
farmfs is a pure python program, and has support for pypy3.
However, performance of pypy3 is actually worse than cPython due to farmfs uses iterators over loops, negating the benefits of most of the JITs optimizations. To improve performance consider improvements to caching, IO parallelization and reducing small string allocations.
python3.9.2
time farmfs snap make --force test_snap
real 0m2.387s
user 0m2.010s
sys 0m0.319s
time farmfs snap make --force test_snap
real 0m2.305s
user 0m1.991s
sys 0m0.312s
time farmfs snap make --force test_snap
real 0m2.258s
user 0m1.939s
sys 0m0.317s
pypy3
time farmfs snap make --force test_snap
real 0m6.363s
user 0m5.850s
sys 0m0.512s
time farmfs snap make --force test_snap
real 0m6.177s
user 0m5.730s
sys 0m0.449s
time farmfs snap make --force test_snap
real 0m6.201s
user 0m5.731s
sys 0m0.455s