implementation-over-my-head remote-content sniffer, thing

Discussion in 'OT Technology' started by piratepenguin, Feb 9, 2007.

  1. piratepenguin

    piratepenguin New Member

    Joined:
    Jun 18, 2006
    Messages:
    1,067
    Likes Received:
    0
    Location:
    Ireland
    As I've touched on before I'm interested in setting up a BitTorrent index. My first most major job is in the indexer program, which will take a .torrent file as input and, among other things, it should as quickly as possible shoot out any interesting info (for which rules that describe how to retrieve the info will have been written and installed) about each p2p-distributed file, so that the index frontend (the website) can inform the user of it (and it won't all simply be dumped for the user to not understand, for the record).

    Beyond basic and not-so-basic filetype and metadata info (which will include (but well-not be limited to!) required codecs (it's just something that would've helped me out in the past, and I see a big need for)), we'll be able to export e.g. random sample screenshots of a video.

    If I didn't care about this info becoming available to the index frontend "as quickly as possible" I could simply set the indexer to download the BitTorrent-distributed contents and run a series of scripts (or, likely better, one optimised program) on each resulting file as each file is completed. The problem with that is that downloading - and specifically completing downloading, the p2p-distributed files will rarely - never, actually, be anything like instantaneous. Most of the useful info (the metadata) will surely be available in the first few kilobytes of the file, which both motivates the elaborate solution I'm thinking up and de-motivates it (taking that fact into account, a very effective (but probably not /as/ effective) and MUCH simpler solution is possible). Additionally, I don't see why there should be a need to always download the _entire_ contents of a file just to retrieve/test for this info (just because we're indexing the torrent file there's no reason we need to store the entire contents of a torrent (I'm not referring to the .torrent file right here, but the p2p-distributed data, btw). I have plans, though, to actually store the contents of indexed torrents, not only to seed them, but all of this will happen only intelligently and selectively (because we don't have infinite bandwidth and cpu power)). Downloading the entire (unnecessary) contents of every torrent to scan it means that less of this scanning will be happening, and results put out slower (and much slower. It is very handy to be able to provide a good bulk of this info just after minutes of the torrent's submission, rather than hours (and yes, I've thought about having the submitter fill this stuff out, smartarse!)).

    In BitTorrent, the p2p-distributed data is split into a number of "pieces" of equal size (typically 512KB). The client connects to the tracker (the URL for which is specified in the .torrent file) to get a list of peers, and it can also get a listing of what pieces each peer has. Then the client picks a piece from a peer and downloads it. So, BitTorrent clients have control of what piece is downloaded when, which is exactly perfect for what we want.

    In this indexer, we'll want the info-retrieving rules to dictate what data is downloaded--umm, requested (because peers can refuse to upload, etc), and when.

    I know this can work, but I'm stumped as to how it'll be implemented! The interface between the rules and the indexer, I don't know what the hell it will be or needs to be like. But whatever functions/methods I employ (seek () and read () and..?), I think the rules can be dynamically loaded shared libraries, and if that's a good idea, for me to figure that much out is a miracle!

    This would be my first serious "real" programming project, so I could do with implementation inspiration!

    All comments appreciated :)
     
  2. piratepenguin

    piratepenguin New Member

    Joined:
    Jun 18, 2006
    Messages:
    1,067
    Likes Received:
    0
    Location:
    Ireland
    Ok. Just gathering the metadata I want will be a massive job in itself.. But it strikes me as obviously useful.

    Google helped me find vyasa, since it's also interested in (well, comparatively very modest) metadata generation. I emailed the author:
    His reply:
    So.. it doesn't seem there are very sophisticated ways out there to dump metadata about a given file (something which strikes me as obviously useful..?), hell, just getting the filetype is dodgy (but MagicMimeTypeIdentifier does pretty damn good, I've reason to believe better can be done though).
     

Share This Page