![]() |
OSDN: Open Source Development Network | US Disaster Support · Newsletters · Jobs |
![]() |
![]() |
![]() |
|
![]() |
Sat, Oct 06th | home | browse | articles | contact | chat | submit | faq | newsletter | about | stats | scoop | 06:29 EDT |
![]() |
not logged in « register « lost password « |
![]() |
|
add comment |
|
![]() |
![]() |
Like everyone else, when I first encountered tree directory systems, I thought they were a marvelous way to organize information. I've been around computers since 1983, and have staunchly struggled to keep files and directories neatly organized. My physical filing cabinet has always been a mess, but I clung to the hope that my hard disk would be perfect. Copyright notice: All reader-contributed material on freshmeat.net is the property and responsibility of its author; for reprint rights, please contact the author directly. For many years, I could draw my full tree directory from memory. Things have changed; I'm doing more things than I can track. Today, my $HOME is 2.4k directories, 43k files, and 1.3G bytes (this is almost all plain ASCII files -- no MS Office, no multimedia -- so 1.3G is a lot). My present filesystem has been uninterruptedly with me since 1993, and there are old things in there that I can scarcely remember. Now, I often wander around $HOME like a stranger, using file completion and "locate" to feel my way around. I recently needed some HTML files that I was sure I had once written, but I didn't know where they were. I found myself reduced to saying: $ find ~ -name '*.html' -print | xargs egrep -il string , which is a new low in terms of having no idea where things might be. This article is a plea for help. We're all used to devoting effort to problems of information retrieval on the net. I think it's worth worrying about inner space. What lies beneath, under $HOME? How can relevant information and files be pulled up when needed? How can we navigate our own HOMEs with less bewilderment and confusion? Can software help us do this better? I know nothing about the literature on information retrieval, but this scratches my itch. Multiplicity of treesWe have accumulated three different tree systems for organizing different pieces of information:
This is a mess. There should be only one filesystem, one set of folders. Email is a major culprit. Everyone I know uses a sparse set of email folders and an elaborate filesystem, so we innately cut corners in organizing email. We really need to make up our minds about how we treat email. Is email a channel, containing material which is in transit from the outside world to the "real" filesystem? In this case, the really important pieces of mail will get stored in their proper directory somewhere, and all other pieces of email will die. I have tried to achieve this principle in my life, with limited success. Or is email permanent (as it is for most people), in which case material on any subject is fragmented between the directory system and email folders? If so, can email folders automatically adopt the organization of the directory system? Can email files be placed alongside the rest of the filesystem? Web browser bookmarks are a third tree-structured organization which should not exist. It's easy to have a concept of having a metadata.html file in every directory, and storing the bookmarks there. The browser would inherit the tree directory structure of $HOME, and when sitting inside any one directory, the pertinent metadata would be handy. Dhananjay Bal Sathe pointed out to me another source of escalation of the complexity of filesystems. This only effects users of software from Microsoft, so I'd never encountered it. It is MS's notion of "compound files", which are objects which look like normal files to the OS but are actually full directory systems (I guess they're like tarfiles). Since the content is hidden inside the compound files, you cannot use all OS tools for navigating inside this little filesystem, only the application that made the compound file. He feels that if compound files had been treated as ordinary directories of the filesystem, it would have been a "simple, beautiful, elegant" and largely acceptable solution instead of the mess which compound files have created. Non-text filesIf you use file utilities to navigate and search inside the filesystem, you will encounter some email. I use the "maildir" format, which is nice in that each piece of email lies in a separate file. However, MIME formats are a problem. When useful text is kept in MIME form, it's harder for tools to search for and access it. MIME is probably a good idea when it comes to moving documents from one computer to another, but it seems to me that once email reaches its destination, it is better to store files in their native format. In my dream world, each directory has all the material on a
subject (files, email, or metadata), and Geetanjali Sampemane pointed out that this is related to the questions about content-based filesystems, and suggested I look at a paper by Burra Gopal and Udi Manber on the subject (ask Google for it). PDF and postscript documentsPostscript and PDF have worked wonders for document transmission over the Internet, but this has helped escalate the complexity of inner space:
While I'm on this subject, I should describe a file naming
convention I've evolved which seems to work well. I like it if a file
is named Authoryyyy_string.pdf; this encodes the lastname of the
author, the year, and a few bytes of a description of what this file
is about. For example, I use the filename
I also take care to use this Authoryyyy_string as the key in my
.bib file, so it's easy to move between the bibliography file and the
documents. I often use regular expression searches on my bibliography
file, and once I know I want a document, I just say Some suggestionsI'm not an expert on information retrieval, so these are just some ideas on what might be possible, from a user perspective.
In summary, people working in information retrieval are focused on searching the Web, but I think we have a real problem lurking in our backyard. Many of us are finding it harder and harder to navigate inside our HOMEs and find the stuff we need. I think it's worth putting some effort into making things better. There is a lot that ye designers of software can do to help, ranging from putting file completion into Mozilla to new ideas in indexing tools. Author's bio: Ajay Shah is an associate professor at IGIDR, Bombay. His research is in financial economics, including the applications of information technology for designing financial products, markets, and trading strategies. You can find out more at http://www.igidr.ac.in/~ajayshah/. T-Shirts and Fame! We're eager to find people interested in writing articles on software-related topics. We're flexible on length, style, and topic, so long as you know what you're talking about and back up your opinions with facts. Anyone who writes an article gets a freshmeat t-shirt from ThinkGeek in addition to 15 minutes of fame. If you think you'd like to try your hand at it, let jeff.covey@freshmeat.net know what you'd like to write about.
Topic :: System :: Filesystems
Mozilla - A Web browser for X11 derived from Netscape Communicator.
[»] The right software ... ... is the answer. Better yet, the right 'filesystem' ... an RDBMS
filesystem where files can be categorized and tagged quickly
(point-and-click -- not hand-typed). Then, while browsing your filesystem
any one file could potentially be found under more than one
'directory'.
[»] The Semantic Web There is a nice article The Semantic Web by Tim Berners-Lee e.a. on the web aspect of information storage/retrieval. (subtitle "A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities") [»] It all comes down to organization skills I've not been dealing with computers as long as most people, but in my
experience, you can keep any operating system or any file system in general
neat, tidy, easy to search, and not full of unknowns by simply keeping it
organized yourself.
[»] Re: It all comes down to organization skills Forgive me for saying so, but the above sounds simplistic. You may be looking at the simple approach, where there generally isn't the need to implement anything like this, because usage doesn't involve large volumes of information from many sources. Deleting seems like a good approach for a work around, but many people have home directories chalk full of "good" stuff, that they really can use (if only they could correctly link it all together/index it in a timely fashion) and wouldn't think of deleting, because there WAS a reason they got it in the first place, and there is still a reason to keep it. Pruning is OK, but don't chop down the tree. (I agree with keeping your binary trees clean, why not right? But data is a little different. Isn't that part of the PC, the whole reason, we have a PC and not just some terminal?) [»] Re: It all comes down to organization skills
When you have large volumes of information from many sources, that is called a data repository, also known as a library. When you have a library, you have an interface to get the data you want. I believe that a filesystem should do nothing more than what it's original intent was - to store data. If you want to retrieve data by 'searching' the stored contents of all your files, you should be using some sort of interface to retrieve that data. It's not the fault of the OS or filesystem that you can't find your stuff - it's your fault.
Organization. If you're unorganized, all that 'good stuff' is theoretically useless if you can't find it when you need it. Also, if you keep enough junk around that you think is 'good stuff' and you never really use it, chances are when you go to use it, it's outdated with something far superiour, or something else completely different (*laugh* ... ipfw, no wait, ipfwadm? nonono, ipchains!, no wait... iptables - that's it!). Like I said, it all comes down to organizational skills - if you aren't adept at keeping structure in your data, you shouldn't be allowed to find what you want when you want it. [»] Re: It all comes down to organization skills
I don't yet.. that's exactly what I need, but I don't have an interface to "get the data" -- just a FS which I chuck stuff in until I do. Where else would I put it lacking a repository? I could put it into a temporary repository that doesn't have all the features I need yet like, say building a searchable web page set. But why do that if I can do it once, properly. No half-measures! Yes it is a problem of knowledge management. The knowledge management systems that exist don't do it for me, and many are commercial, so they are closed. No thanks. Too many bad experiences. None of them integrate tightly enough. I sited Livelink already, that is a good example of something that makes the web a repository, but is commercial and doesn't have everything I would need to setup the repository properly.
I'm sorry, I know I screwed up, I'll do better next time. I knew that the filesystem wasn't designed for what I am trying to do.. That's why I've just been storing data on it, like it was intended to be used. But wait a minute, that was my point! Maybe the filesystem needs to be extended -- not necessarily all in the kernel space, but the user space as well!
Actually it is still useful, it's just it is less likely it will be used effectively.
I'm allowed to do what-ever I please, regardless of qualification or skill with my own hardware (with-in bounds of law). I think I should be able to. And I am trying to better keep structure *in my data* -- not my head! That's why the computer should store structure or metadata, not just data that requires us to worry about the inforcement of the structure as an after-thought. That is error prone, as we are not machines. Isn't the computer suppose to help us store, retrieve and compute information. Why not design it better to do so? I'm not the computer, the computer is the computer. I want to be able to have it present the information in an optimal and timely fashion, that can offload some of the burden of remembering the structure of my data. --- Allan Fields [»] Right On! I think you raise a very important
issue. This all makes me think of the
"Future Vision" section of the Namesys
page, where Hans Reiser talks about the
need for a mathematical closure between
applications and bringing more advanced
features into the FS much like the idea
of 'the database is the
filesystem'. [»] The tree structure is one problem I also ran into the data organization problem in 1993 (when I last time lost files). I found among others the file metaphor and strict tree structure a major mismatch with human cognition. To tackle the problem Askemos was done. It really helps. BTW: Askemos is a GPLed software (soon to be recategorized at freshmeat), wich faces a legal threat at the moment. Please help to keep it free, download! Thaks [»] Re: The tree structure is one problem Jerry, I've taken a brief look, and find the structure a little daunting (also unfortunately I don't speak much German :( ) -- some of conepts seem neat.. I am interested to find out more, so I'll take another look some time soon. Good to see people working on solutions... One thing we perhaps should be careful of is to allow these solutions to have a tight level of integration to existing facilities so that they are intuative to users and don't appear to be a layer on a layer on a layer of storage (The multiplicity of trees - as mentioned above). Yours appears to also be an anonymous sharing protocol? [»] Re: The tree structure is one problem
[»] when in doubt use brute force
[»] Re: when in doubt use brute force Why use a search engine in your $HOME? Simply delete files not needed and backup everything you may need in the future to an external storage device like a tape archiver or a cd-r. Your $HOME will stay small and tidy. Just make sure to go through this procedure once a week or once a month. My $HOME is organized that way. However I store HTML, downloads, pictures and other non-plain-text information in there. I use some well known subdirectories and it works perfectly. Simply tidy up! Why using complex software for things that can be achieved with a little self discipline or even cronjobs? -- [»] Re: when in doubt use brute force
I've seen this recommended by several folks, so don't think I'm singling you out... Simply archiving and deleting things does not solve the problem, they make it *worse*. How do you find out what the hell you've archived? "Gee, I know John sent me an article he wrote on graphing small population relationships, but which of these 50 CDs or 150 backup tapes did I put that on?" Greg [»] What about using the remembrance agent I have more than decent success by using Bradley Rhodes' Remembrance Agent.
That does a very good job by trying to provide me with JITIR.
[»] Re: What about using the remembrance agent Trees are inherently limited to single entry. Organizing documents in a single tree will inevitably hit that wall. The only way to break through is to use thesaurus based keywords. The snag is that thesaurus building is a task of pharaonic proportions. The quick and dirty approach that I used successfully when in dire need of hacking my way through 40GB of ps, pdf, txt, doc, ppt, html and xls documents is to use a full text indexer with external parsers. ht://Dig has done a great job (although phrase searching sorely lacks for now). Thesaurus based keyword indexation is best because documents can be hit from any semantic angle. I would love to have the time and resources to do it for my company. But in the real world, meaningful file names, a basic and sane tree and full text indexing on top of that will do cheaply. As far as mail is concerned, the single entry tree problem is somewhat alleviated by virtual folder approaches such as with Evolution [»] Re: What about using the remembrance agent Sorry, I hit "reply" and forgot to modify the title. My post's title should read : "Experience dealing with large numbers of heterogeneous documents". Relational data rules ! [»] There are tools... I had the same problem. Until I discovered that
-- [»] Re: There are tools... Yes, the tools are there. But more often than not you need to write them yourself. Only in the last few months have I got my scripts down so that not even a tmp file escapes my wrath (Yay for PERL). The overheads for this probably aren't worth it, and there's still a few bugs. The package (as yet unreleased) needs to work at a relatively low-level to query the fs to see which files have been opened (it presently only works on x86 machines) and another cron to take an image of the complete filesystem once a day, compare it agains the previous day, see which files have been opened in comparison, and stores this and other data in a mySQL table. Then... every month, like clockwork, I switch my pootie on and it takes about an hour to archive all of the unused files for the period. After that, it's just a matter of scanning through the .zip's and removing what I don't really need. Seems a bit gratuitous, really. But it works. -- [»] Re: There are tools...
[»] Re: There are tools...
Also, it appears PostgreSQL and MySQL are a little behind Oracle in some of the Object over Relation framework features. Even nicer is the OO or Object-Relation ODBMSes like Cache, DB40 or (open source example) GOODS and Gigabase. On the DB access layer another project that caught my eye was ColdStore (persistence framework using simple DB). And then there is J2EE for Java which is something to look at for Java apps. There are lots of things to look at, and there are many projects adressing specific sections of the problem... --- Allan Fields [»] Re: There are tools...
Yes, web based tools are probabily the most complete ones. And, yes, sometimes you are just getting very close. But most of the time, since the web tools are (some of them at least) rather standard you can adapt yourself to the tools.
I will post several links but, as usually you should check for yourself. Especially open source projects are sometimes moving very fast. And let's hope that others will reply adding some more. About the repository: I use the common cvs system that can be found at www.cvshome.org The cvs from there can be used fron the command line, it has no graphical interface or web interface. But once you've set a repository (or more) you can use several tools that are available: CVSWeb http://stud.fh-heilbronn.de/~zeller/cgi/cvsweb.cgi/ This is a single perl script that does not need a database. You need to have the repository set up and that's it ViewCVS http://freshmeat.net/projects/viewcvs/ The one I am using now. It is based on CVSWeb but is in phyton. You can download tarballs from your repositories and it has syntax highliting for a lot of file types (based on enscript). It can be used with a MySql database but this is not compulsory. The database does not keep the repository, just information. Chora http://horde.org/chora/ I haven't personaly tried this one. But I've seen some online repositories and in matches pretty close the previous one. Freepository www.freepository.com The one I'll use in the future :) if I have time to move all my stuff from Mysql to Postgresql. It is a full web based tool, checkin, checkout, whatever. Postgresql backend. For CVS documentation ot tutorials: go to the ViewCVS site, there are several links. In principal, my web site has to have: A news system - something that grabbs the news from slashdot, freshmeat, etc. A Calendar A bookmark manager A place for notes A place to put small articles that I find on the web A photo gallery An E-mail system A file manager CVS Interface An interface to the computer administration tools web ssh login There are several tools that can do this. I will mention two (although I am sure that more - maybe better - can be found. PhPGroupware - multiuser groupware tool that has everything in the above list, except the last two (as far as I know). The cvs interface is chora, mentioned above. Very activelly developped (if you go on Sourceforge you will almos always see it on one of the first three places. www.phpgroupware.org PhPNuke + several modules (News, Gallery, Calendar, etc) All can be found on the phpNuke site: www.phpnuke.org PhPNuke is a system for building news sites but you can use it for all of the above list, except the last 5. I am using now PhPNuke. For other tasks: E-mail system: There are a lot of webmail programs. If you have the mail delivered to your machine then you can use a href= "http://neomail.sourceforge.net/">Neomail or Openwebmail and much more. Web based file managers: an interesting one is PhPFileFarm Interface for administration: the best one seems to be WEBMIN (I am running Linux, you should check their page for other systems). http://www.webmin.com/webmin/ Webmin has also a file manager and a ssh login shell, and much more. Or you can use a combination of: MyPhPIM http://sourceforge.net/projects/myphpim/ which has mail, calendar, todo, addresbook and other tools described above.
I am not very familiar with object oriented database. Postgress has table inheritance though. But, for such a project (that is personal therefore single user) a lightweight database seems the best choice. This is why I am not yet convinced to move my system fropm Mysql.
This is true. It will be more than nice to start a project for this - a personal web content manager. Oh, I mentioned HtDig. It is a search/indexing engine that can be found at: www.htdig.org > > --- > Allan Fields -- [»] File naming rules More thoughts on file naming rules: [»] Storing files When a file is being downloaded, the user is required to supply a filename and a path. I would really like it if authors of software (like Mozilla) gave us a commandline with file completion to do this. I find the GUI interaction that they force me to have extremely inefficient, and it costs so much time that when I'm in a hurry, I tend to misclassify an incoming file. This is perhaps the biggest problem -- it's so easy to just dump a file in the default directory that people don't take a couple of seconds to put it somewhere sensible. A solution? Get rid of the save dialog box and replace it with a draggable icon. To save, the icon is dragged to a filer window, directory on the panel, etc. Common save destinations (eg, the project you're currently working on) can then be kept handy along the bottom of the screen (or whereever). See here for an implementation of this system. As any computer scientist knows, spending a little extra time storing your data can help a lot when it comes to retrieving it! BTW, I agree that an indexing agent should update as the filesystem is changed. The current massive-scan-once-a-day is slow and irritating. [»] Re: Storing files All good ideas, I think these type of UI inovations are what we all need! [»] Re: Storing files
Actually, come to think about it, no reason to rid of it, just implement another approach and allow them to be configured on or off. [»] Re: Storing files Didn't Acorn's RiscOS do this wrt saving stuff?
[»] Re: Storing files
Yep; my implementation looks very similar to it. [»] Mail / FS This is exactly what mh-mail /nmh is designed for
[»] Data and metadata The solution to these problemas has been discussed in tom's hardware. The
proper solution is to have a filesystem that stores metadata, such as
ReiserFS, and a unified interface to it, such as OMS (a XML dialect and
categorization/metadata standard for storing metadata).
[»] beginnings I've been wanting a non-hierarchical organizational system for quite some time. My main reason for wanting this is to organize browser bookmarks that can belong to more than one category. So, I've written the beginnings of such a system which can be found at zenlunatics.com It's currently somewhere around the alpha stage and I haven't worked on it in a while. I haven't written a bookmark manager yet but did write an image viewer, an mp3 player, a simple note keeper and a utility for creating catalogs from a file system. For the bookmark manager I'm thinking of modifying gnobog, galeon or maybe mozilla (suggestions welcome). After that I'd like to like to tackle the file system possibly with a document launcher although I recently read about multi-session support which may solve that problem in a different way. Anyway I'd really appreciate any comments on zl_catalog including suggestions for a better name :-)
thanks, [»] Re: beginnings Hi, Looks good, I think you and all the other authors that have been working on these types of projects are heading in the right direction. We need to make sure we can bridge between all the apps, solutions, FS, transport mechanisms, etc. The library is definitely a great idea. Also an exhaustive effort is probably required to rival the integration of some commercial environments where integration is a goal and part of the project. I have visions of what the filesystem should be like and how it should interface to the UI/shell. They are in many ways in agreeance with Reiser and the original Macintosh vision and in some aspects of Windows (all though I am no Microsoft fan) -- and many different schools of thought! I definitely agree with the author of the originating post, he has got some great points!! Thanks to all that are working on a solution to this existent and persisting problem of computer science (which may have been solved already in some past era, if only we could revive the great softwares of the past!!! -- and which might already be solved already in some expensive commercial package that I can't afford and wouldn't want to use because of the software model.) [»] Dumb data and the file system I think that this article is brilliant, and also enjoyed the comments.
[»] This problem is solved very neatly already. Has nobody noticed...? BeOS solves this problem very nicely, using something similar to the
suggestion in the first post.
[»] Re: This problem is solved very neatly already. Has nobody noticed...? To expand on the above, BeOS did the above well because it was so integrated into the OS and all applications made use of it. Scott Hacker had a great article on BeOS filetypes at www.byte.com a while back that explained how apps such as the web browser would automatically fill in certain attributes for the user (such as the source URL, date it was downloaded, mimetype etc), and that mp3 rippers would populate the song title, author, track length etc. My understanding is that the issue of filesystem metadata has been discussed on the Linux kernel mailing list, and Linus and co are trying to work on a proper implementation. [»] Filesystems, objects, databases and a command line interface... Thanks for this inspiring article and comments. It got me started thinking
about how such an information retrieval system can be organised without
dropping too many of the benefits many people like with their Unix-like
systems. There is not necessarily much new in the text below.
[»] Indexing PDFs I don't think having a bib file for every PDF is a good solution. PDF provides the possibility for metadata inclusion (pdflatex lets you do this, and I'm sure Adobe's own tools as well). Unfortunately, almost nobody is using these correctly. You can store title, author, date, keywords and probably more in a PDF. Command line tools like pdfinfo (or CTRL-D in Acrobat Reader) show this information. Additional files always get lost or do not get updated, so store the information in the files! [»] Problems with superstrings The idea is good, there is just a minor problem with character encoding. I don't say this cannot be solved, but I noticed that with latex special characters often get described in a visual way - e. g. a small 'a' with two dots on top of it. While this results in nice output, the information that this really is an umlaut 'a' (as in German Bär or Jägermeister) gets lost (does not get stored in hte dvi or pdf file that is the result of the (pdf)latex run). So searching for anything with ä in it doesn't work anymore. [»] Another ERP-System for those of you in need for a linux based ERP-System.
[»] MacOS X "bundles" In MacOS X we have "compound files" which can be navigated like
the directory structures they really are. In the MacOS Finder they look and
work like a single file, but after opening a shell you can easily see
what's inside.
[»] See also.. Story on OSNews http://www.osnews.com/story.php?news_id=69 [»] Isn't Oracle IFS the answer (albeit an expensive one) I do not have any experience in Oracle IFS (Internet File System), but from
what I have heard and read about it seems to be the answer.
[»] command line vs gui in regards to being able to call up the command-line interface when saving. Why not intergate command-line functionality into the gui. For example why not have path name completion in the filename text box. So a user can use either the gui or command-line. I feel that gui need to have the power of command-lines, but I'm still yet to witness this. Another idea I have is, a find file dialog (like that in windows explorer) that allows for regular expressions. Why give up the command-line for a gui interface when we can have the command-line built-in the gui.
|
![]() |