15

Suppose I have a directory that contains 100K+ or even 500k+ files. I want to read the directory with fs.readdir, but it's async not stream. Someone tell me that async use memory before done read the entire file list.

So what is the solution? I want to readdir with stream approach. Can I?

6
  • 1
    before you believe people when they make those claims: have you tried? Also: a dir with 100k or 500k files is insane, you should not have data organised that way. You can't even rm that many files. Commented Sep 10, 2014 at 5:18
  • @Mike'Pomax'Kamermans, see first anwswer: "I've just tested with 700K files in the dir. It takes only 21MB of memory to load this list of file names." what if I have 1M or 10 million files in directory? Commented Sep 10, 2014 at 7:17
  • 1
    your filesystem is not a database. A million files in a dir is insane and instead of finding a code solution you need to first organise your data better, as a good practice. Commented Sep 10, 2014 at 16:47
  • 6
    @Mike'Pomax'Kamermans Stack Overflow (and the future community) will be a richer place if we assume that the directory structure is outside of his control. Commented May 19, 2016 at 7:57
  • @DomVinyard Not really - for generally applicable answers you assume a normal setup unless the person asking the question stipulates otherwise, either in their post, or in response to comments/answers, which means you assume people control all aspects of the technology they're talking about, unless otherwise indicated. Commented May 19, 2016 at 13:02

6 Answers 6

8

In modern computers traversing a directory with 500K files is nothing. When you fs.readdir asynchronously in Node.js, what it does is just read a list of file names in the specified directory. It doesn't read the files' contents. I've just tested with 700K files in the dir. It takes only 21MB of memory to load this list of file names.

Once you've loaded this list of file names, you just traverse them one by one or in parallel by setting some limit for concurrency and you can easily consume them all. Example:

var async = require('async'),
    fs = require('fs'),
    path = require('path'),
    parentDir = '/home/user';

async.waterfall([
    function (cb) {
        fs.readdir(parentDir, cb);
    },
    function (files, cb) {
        // `files` is just an array of file names, not full path.

        // Consume 10 files in parallel.
        async.eachLimit(files, 10, function (filename, done) {
            var filePath = path.join(parentDir, filename);

            // Do with this files whatever you want.
            // Then don't forget to call `done()`.
            done();
        }, cb);
    }
], function (err) {
    err && console.trace(err);

    console.log('Done');
});
Sign up to request clarification or add additional context in comments.

6 Comments

yes, actually, i don't want to read the contents of files, but listing filename in dir.. > It takes only 21MB of memory to load this list of file names. that is the problem, I wanna use stream approach
I would suggest changing the directory structure if you want to store millions of files. Alternatively use a database of some kind. If you still want to stick to your original idea, I suggest you have a look at github.com/oleics/node-filewalker. It may provide what you're looking for. Under the hood it does the same logic, i.e. it reads the file listing of the entire directory into the memory anyway. Unless you will manually access the hard drive and try to read the directory listing block by block, there is no other way in Node.js to do this that I am aware of.
@Eye How does node-filewalker address the problem though?
Basically, node-filewalker wraps the fs module in an EventEmitter. Traverses the file system, emits events for each directory and file. Does that asynchronously. For the above case, node-filewalker can succeed equally well.
node-filewalker just hides the problem. It silll load the whole directory into an array. This is bad behavior. But the cumming version of libuv, has a stream version of readdir. That is a huge step toward the correct behavior.
|
8

Now there is a way to do it with async iteration! You can do:

const dir = fs.opendirSync('/tmp')

for await (let file of dir) {
  console.log(file.name)
}

To turn it into a stream:


const _pipeline = util.promisify(pipeline)
await _pipeline([
  Readable.from(dir),
  ... // consume!
])

1 Comment

Unless I'm confused, doesn't opendirSync return all files in the directory synchronously like the name implies?
1

The more modern answer for this is to use opendir (added v12.12.0) to iterate over each found file, as it is found:

import { opendirSync } from "fs";

const dir = opendirSync("./files");
for await (const entry of dir) {
  console.log("Found file:", entry.name);
}

fsPromises.opendir/openddirSync return an instance of Dir which is an iterable which returns a Dirent (directory entry) for every file in the directory.

This is more efficient because it returns each file as it is found, rather than having to wait till all files are collected.

Comments

0

Here are two viable solutions:

  1. Async generators. You can use the fs.opendir function to create a Dir object, which has a Symbol.asyncIterator property.
import { opendir } from 'fs/promises';

// An async generator that accepts a directory name
const openDirGen = async function* (directory: string) {
    // Create a Dir object for that directory
    const dir = await opendir(directory);

    // Iterate through the items in the directory asynchronously
    for await (const file of dir) {
        // (yield whatever you want here)
        yield file.name;
    }
};

The usage of this is as follows:

for await (const name of openDirGen('./src')) {
    console.log(name);
}
  1. A Readable stream can be created using the async generator we created above.
// ...
import { Readable } from 'stream';

// ...

// A function accepting the directory name
const openDirStream = (directory: string) => {
    return new Readable({
        // Set encoding to utf-8 to get the names of the items in
        // the directory as utf-8 strings.
        encoding: 'utf-8',
        // Create a custom read method which is async, but works
        // because it doesn't need to be awaited, as Readable is
        // event-based anyways.
        async read() {
            // Asynchronously iterate through the items names in
            // the directory using the openDirGen generator.
            for await (const name of openDirGen(directory)) {
                // Push each name into the stream, emitting the
                // 'data' event each time.
                this.push(name);
            }
            // Once iteration is complete, manually destroy the stream.
            this.destroy();
        },
    });
};

You can use this the same way you'd use any other Readable stream:

const myDir = openDirStream('./src');

myDir.on('data', (name) => {
    // Logs the file name of each file in my './src' directory
    console.log(name);
    // You can do anything you want here, including actually reading
    // the file.
});

Both of these solutions will allow you to asynchronously iterate through the item names within a directory rather than pull them all into memory at once like fs.readdir does.

1 Comment

Just now realizing that I stupidly wrapped an async generator in an async generator. You don't actually need the openDirGen generator function
0

The answer by @mstephen19 gave the right direction, but it uses an async generator where Readable.read() does not support it. If you try to turn opendirGen() into a recursive function, to recurse into directories, it does not work anymore.

Using Readable.from() is the solution here. The following is his solution adapted as such (with opendirGen() still not recursive):

import { opendir }  from 'node:fs/promises';
import { Readable } from 'node:stream';

async function* opendirGen(dir) {
    for await ( const file of await opendir('/tmp') ) {
        yield file.name;
    }
};

Readable
    .from(opendirGen('/tmp'), {encoding: 'utf8'})
    .on('data', name => console.log(name));

Comments

-1

As of version 10, there is still no good solution for this. Node is just not that mature yet.

modern filesystems can easily handle millions of files in a directory. And of cause you can make a god cases for it, in a large scale operations, as you suggests.

The underlying C library iterates over the directory list, one at a time, as it should. But all node implementations I have seen, that claims to iterate, uses fs.readdir, that reads all into memory, as fast as it can.

As I understand it, you have to wait for a new version of libuv to be adopted into node. And then for the maintainers to address this old issue. See discussion at https://github.com/nodejs/node/issues/583

Some improvements will happen in with version 12.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.