OSDN: Open Source Development Network US Disaster Support · Newsletters · Jobs   
fmII
Sat, Oct 06th home | browse | articles | contact | chat | submit | faq | newsletter | about | stats | scoop 06:29 EDT
Search for in
not logged in «
register «
lost password «
[Article] add comment [Article]

 Information retrieval from $HOME
 by Ajay Shah, in Editorials - Thursday, August 30th 2001 00:00 EDT

Like everyone else, when I first encountered tree directory systems, I thought they were a marvelous way to organize information. I've been around computers since 1983, and have staunchly struggled to keep files and directories neatly organized. My physical filing cabinet has always been a mess, but I clung to the hope that my hard disk would be perfect.


Copyright notice: All reader-contributed material on freshmeat.net is the property and responsibility of its author; for reprint rights, please contact the author directly.

For many years, I could draw my full tree directory from memory. Things have changed; I'm doing more things than I can track. Today, my $HOME is 2.4k directories, 43k files, and 1.3G bytes (this is almost all plain ASCII files -- no MS Office, no multimedia -- so 1.3G is a lot). My present filesystem has been uninterruptedly with me since 1993, and there are old things in there that I can scarcely remember. Now, I often wander around $HOME like a stranger, using file completion and "locate" to feel my way around. I recently needed some HTML files that I was sure I had once written, but I didn't know where they were. I found myself reduced to saying:

	$ find ~ -name '*.html' -print | xargs egrep -il string

, which is a new low in terms of having no idea where things might be.

This article is a plea for help. We're all used to devoting effort to problems of information retrieval on the net. I think it's worth worrying about inner space. What lies beneath, under $HOME? How can relevant information and files be pulled up when needed? How can we navigate our own HOMEs with less bewilderment and confusion? Can software help us do this better? I know nothing about the literature on information retrieval, but this scratches my itch.

Multiplicity of trees

We have accumulated three different tree systems for organizing different pieces of information:

  1. The filesystem
  2. Email folders
  3. Web browser bookmarks

This is a mess. There should be only one filesystem, one set of folders.

Email is a major culprit. Everyone I know uses a sparse set of email folders and an elaborate filesystem, so we innately cut corners in organizing email.

We really need to make up our minds about how we treat email. Is email a channel, containing material which is in transit from the outside world to the "real" filesystem? In this case, the really important pieces of mail will get stored in their proper directory somewhere, and all other pieces of email will die. I have tried to achieve this principle in my life, with limited success.

Or is email permanent (as it is for most people), in which case material on any subject is fragmented between the directory system and email folders? If so, can email folders automatically adopt the organization of the directory system? Can email files be placed alongside the rest of the filesystem?

Web browser bookmarks are a third tree-structured organization which should not exist. It's easy to have a concept of having a metadata.html file in every directory, and storing the bookmarks there. The browser would inherit the tree directory structure of $HOME, and when sitting inside any one directory, the pertinent metadata would be handy.

Dhananjay Bal Sathe pointed out to me another source of escalation of the complexity of filesystems. This only effects users of software from Microsoft, so I'd never encountered it. It is MS's notion of "compound files", which are objects which look like normal files to the OS but are actually full directory systems (I guess they're like tarfiles). Since the content is hidden inside the compound files, you cannot use all OS tools for navigating inside this little filesystem, only the application that made the compound file. He feels that if compound files had been treated as ordinary directories of the filesystem, it would have been a "simple, beautiful, elegant" and largely acceptable solution instead of the mess which compound files have created.

Non-text files

If you use file utilities to navigate and search inside the filesystem, you will encounter some email. I use the "maildir" format, which is nice in that each piece of email lies in a separate file. However, MIME formats are a problem. When useful text is kept in MIME form, it's harder for tools to search for and access it.

MIME is probably a good idea when it comes to moving documents from one computer to another, but it seems to me that once email reaches its destination, it is better to store files in their native format.

In my dream world, each directory has all the material on a subject (files, email, or metadata), and grep would work correctly, without being blocked by MIME-encoded files.

Geetanjali Sampemane pointed out that this is related to the questions about content-based filesystems, and suggested I look at a paper by Burra Gopal and Udi Manber on the subject (ask Google for it).

PDF and postscript documents

Postscript and PDF have worked wonders for document transmission over the Internet, but this has helped escalate the complexity of inner space:

  • As with MIME, .ps and .pdf files are not vulnerable to searches for regular expressions as text files are.
  • An interesting and subtle consequence of the proliferation of .ps and .pdf files in my filesystem is that a larger fraction of the files there are alien. In the olden days, every file that was in my filesystem was mine. It used my file naming conventions, etc., so when I wandered around my filesystem, I knew my way. Today, there are so many alien files hanging around that it reduces my confidence that I know what is going on.
  • Every now and then, I notice a .pdf file "which is going to be invaluable someday", and snarf it. If I'm lucky, it has a sensible filename, and if I'm lucky, I'll place it in the correct place in my filesystem. In this case, there's a bit of a hope that it'll get used nicely in the future. Unfortunately, a lot of people use incomprehensible names for .pdf files, such as ms6401.pdf, seiler.pdf, D53CCFF4C9021C19988841169FB6FD6EC1D56F711.pdf, and sr133.pdf. I find that interactive programs like Web browsers, email programs, etc. are clumsy at navigating tree directories, so my habit is to save into /tmp, then move the file using the commandline. Sometimes, I'm in too much of a hurry, and this gets messed up. Now and then, I place an incoming file into $HOME/JUNKPDF, hoping that I'll get around to organizing it later.

While I'm on this subject, I should describe a file naming convention I've evolved which seems to work well. I like it if a file is named Authoryyyy_string.pdf; this encodes the lastname of the author, the year, and a few bytes of a description of what this file is about. For example, I use the filename SrinivasanShah2001_fastervar.pdf for a paper written by Srinivasan and Shah in 2001 about doing VaR faster.

I also take care to use this Authoryyyy_string as the key in my .bib file, so it's easy to move between the bibliography file and the documents. I often use regular expression searches on my bibliography file, and once I know I want a document, I just say locate Authoryyyy to track it down.

Some suggestions

I'm not an expert on information retrieval, so these are just some ideas on what might be possible, from a user perspective.

  • Email and Web bookmarks. As mentioned above, we really need a solution to the problem of email folders versus Web bookmark folders versus the filesystem. I'd like to have a MUA and a Web browser which treat my normal filesystem as the classification scheme to use, and save information in the corresponding directories. Every time I make changes to the directory structure, the MUA and browser should automatically use the newest one.
  • Fulltext search. I think we should have fulltext search engines which are hooked into the filesystem. Every time a file under $HOME changes, the search engine should update its indexes. Like Google, this search engine should walk into .html, .pdf, and .ps files and index all the text found therein. This will give us the ability to search inside inner space.
  • URLs-as-symlinks. If we had a fulltext search engine which worked on $HOME, it'd be nice if we could have a concept of a symlink which links to a URL. This reduces overhead in the filesystem, and ensures that one is always accessing the most recent version of the file (in return, one suffers from the problem of stale links, but hopefully producers of information will be careful to leave redirects). By placing symlinks into my directory, I'd feed PDF or PS files into the universe that my personal search engine indexes. These files would be just as usable as normal downloaded files as far as Unix operations such as reading, printing, emailing, etc. are concerned. Web browsers should give me a choice between downloading the file and placing a symlink with a filename of my choice in a directory of my choice.

    Dhananjay Bal Sathe reminded me that there is a good case for doing this on a more ambitious scale, to comprehensively support URLs as files so one would be able to say
    $ cp URL file
    or
    $ lynx http://fqdn/path/a.html
    :-) and it should work just fine. This goes beyond just symlinks.

  • Digital libraries. I have seen software systems like Greenstone which do a good job of being digital library managers, and they may be part of the solution.
    I have sometimes toyed with the idea of using a digital library manager for all alien files. I could have a lifestyle in which every time I got a .pdf or .ps file from the net, I would simply toss it at the digital library software. (It would be nice if Mozilla and wget supported such a lifestyle with fewer keystrokes.) The digital library manager of my dreams would extract all the text from these files and fulltext index them (something that most library managers do not do), and it would not force me to type too much information about the file (which most of them do).
    The logical next step of this idea is a digital library manager which just scours my $HOME ferreting out all files and fulltext indexing them, and that seems like a better course. In this case, it's just my fulltext search engine which indexes everything in $HOME.
  • Bibliographical information for the library manager. One path for progress could be for people who publish .pdf and .ps documents on the Web to use some standards through which XML files containing bibliographical information about them are also available. Every URL http://path/file.pdf should be accompanied with a http://path/file.bib.xml, which contains the information.
    I know one initiative -- RePEC -- in which people supplying .pdf or .ps files also supply bibliographical information about them, but I think it's not quite there yet; it requires too much overhead. The proposal above is simpler. Every time a client fetches http://path/file.pdf, it can test for the existence of http://path/file.bib.xml, and if that's found, the user is spared the pain of typing bibliographical information into his digital library manager.
  • A user interface for supplying a path. When a file is being downloaded, the user is required to supply a filename and a path. I would really like it if authors of software (like Mozilla) gave us a commandline with file completion to do this. I find the GUI interaction that they force me to have extremely inefficient, and it costs so much time that when I'm in a hurry, I tend to misclassify an incoming file. File completion is the fastest way to locate a directory inside a filesystem, and I think I should at least have the choice of configuring Mozilla to use it instead of the silly GUI interface. When we re-engineer Unix to make it easy-to-learn, we should not give up easy-to-use.
  • Quality scoring in inner space. A search string will get hundreds of hits on a fulltext search engine, so how can software give us a better sense of which are the important documents and which aren't? In the problem of searching inside inner space, Google's technology (of counting hyperlinks to you) will not work. A few things that might help in inventing heuristics:
    1. The most recently read or written files should be treated as more important.
    2. Files that are accessed more often should be treated as more important. (This will require instrumenting the filesystem component inside the kernel.)
    3. Makefiles articulate relationships between files. An information retrieval tool that crawls around $HOME should use this information when it exists. Targets in makefiles are less important, and files mentioned in make clean or make squeaky are less important.
      As an example, such intelligence would really help an information retrieval tool which hit my $HOME. In every document directory, I have a Makefile, and the tool could use it to learn that a few .tex files matter, and the .pdf or .ps files do not (since they are all produced by the Makefile, and mentioned in make clean and make squeaky).
    4. "My files are more important than files by others" is a useful principle, but it's difficult to accurately know the authorship of a file. The URLs-as-symlinks idea (mentioned earlier) can help. If I have snarfed a .pdf file down into a directory, the search engine has no way of knowing that it's an alien file. If I have left a symlink to the .pdf file, the search engine knows this should be indexed, but at a lower priority.
  • Less is more -- how to store less. One way to reduce the complexity of the filesystem is to help people feel comfortable about not downloading from the net. When I see a page on the net that looks interesting, I tend to download it and keep a local copy, partly because I'm thinking that I might not be able to find it later.
    Instead, I'd like to hit a button on the browser which talks to Google and says "I think this page could be useful to me." From this point on, when I do searches with Google, this page should earn a higher relevancy score. If a large number of people used Google in this fashion, it would be a new and powerful way for Google to obtain information about the quality of pages on the Web.
  • Superstrings. I think we need a tool called superstrings which thinks intelligently about the files it is facing. If the file it faces is a normal textfile, superstrings is just strings(1), but if it faces .pdf, .ps, MIME, etc. it should extract the useful text with greater intelligence than ordinary strings(1). This can be combined with grep, etc., to improve tools for information access in the filesystem.
  • Help me delete files. Deleting files is one important way of reducing complexity. I'd like to get data about what parts of my filesystem I am never reading/touching. I could launch into spring cleaning every now and then and blow away files and directories that are really obsolete, supported by evidence about what I tend to use and what I tend to ignore. Note that I'm only envisioning a decision support tool, not an automated tool which deletes infrequently-used files. (Once again, this will require instrumenting the filesystem component inside the kernel.)

In summary, people working in information retrieval are focused on searching the Web, but I think we have a real problem lurking in our backyard. Many of us are finding it harder and harder to navigate inside our HOMEs and find the stuff we need. I think it's worth putting some effort into making things better. There is a lot that ye designers of software can do to help, ranging from putting file completion into Mozilla to new ideas in indexing tools.


Author's bio: Ajay Shah is an associate professor at IGIDR, Bombay. His research is in financial economics, including the applications of information technology for designing financial products, markets, and trading strategies. You can find out more at http://www.igidr.ac.in/~ajayshah/.


T-Shirts and Fame!

We're eager to find people interested in writing articles on software-related topics. We're flexible on length, style, and topic, so long as you know what you're talking about and back up your opinions with facts. Anyone who writes an article gets a freshmeat t-shirt from ThinkGeek in addition to 15 minutes of fame. If you think you'd like to try your hand at it, let jeff.covey@freshmeat.net know what you'd like to write about.

[add comment]

 Referenced categories

Topic :: System :: Filesystems
Topic :: Text Processing :: Indexing
Topic :: Internet :: WWW/HTTP :: Indexing/Search
Topic :: Communications :: Email :: Email Clients (MUA)
Topic :: Internet :: WWW/HTTP :: Browsers

 Referenced projects

Mozilla - A Web browser for X11 derived from Netscape Communicator.
Greenstone - A system for digital library creation, management, and distribution.

 Comments

[»] The right software ...
by dean - Aug 30th 2001 00:44:03

... is the answer. Better yet, the right 'filesystem' ... an RDBMS filesystem where files can be categorized and tagged quickly (point-and-click -- not hand-typed). Then, while browsing your filesystem any one file could potentially be found under more than one 'directory'.

The methods mentioned in the article are sound. But trying to keep ALL your files on disk seems wasteful. Old stuff not seen in years should probably be backed-up, catalogued and removed.

Heirarchial storage requires rigid organization and diligent maintenance. It probably has outlived it's usefulness.

[reply] [top]

[»] The Semantic Web
by Joerg Fehlmann - Aug 30th 2001 01:36:41

There is a nice article The Semantic Web by Tim Berners-Lee e.a. on the web aspect of information storage/retrieval.

(subtitle "A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities")

[reply] [top]

[»] It all comes down to organization skills
by Eli Sand - Aug 30th 2001 02:46:30

I've not been dealing with computers as long as most people, but in my experience, you can keep any operating system or any file system in general neat, tidy, easy to search, and not full of unknowns by simply keeping it organized yourself.

Most Linux distrobutions suffer from tons of useless garbage lying around in common directories such as /usr/bin (bet you don't even know what half those programs are for!) and the likes. I find that the most part of people who use computers tend to be messy slobs when it comes to organizing their data.

My $HOME is virtually blank, I have my public_html, a couple of things I save for later reference which I quickly move/delete when I'm done with it. I've tidied up every directory in Linux, including /usr/bin (and I know what every program does). I know where every file is, and only resort to using locate or find when I need to get a list of files that may match a specific pattern (eg: manually removing a program after install and not watching 'make install'). My Windows partitions are kept spanking clean, and I know what stuff should and shouldn't be there. The only exception is my System directory in Windows, I just try to keep from installing useless programs and knowing what .dll's I need.

So when you step back and take a look at it all, you don't need to re-invent the wheel on where things should be stored, how to store them, and all that jazz.

Don't collect so much useless crap you think might be usefull - keep it if it IS usefull and bloody well use it, and when you're done, delete what you don't need!

If you don't like a file name, read up on the ren/mv commands! Amazingly I have yet to come across a file system that DOESN'T let you rename files!

As for other data files that contain non-plaintext data, use a usefull filename (after all, its text, so renaming it will not break stuff) and store it somewhere that makes sense. JUNKPDF isn't such an example - try something like Filesystems-renaming.pdf

Only flaw to my rant is that certain software bundles do have dorks who love to make a mess of your nicely structured file system. I just don't use their software or if it's for Linux and I've got their source, I try to edit what I can to make it better (then send them a .patch - usually pisses them off :)

[reply] [top]

[»] Re: It all comes down to organization skills
by Allan Fields - Aug 30th 2001 04:35:47

Forgive me for saying so, but the above sounds simplistic. You may be looking at the simple approach, where there generally isn't the need to implement anything like this, because usage doesn't involve large volumes of information from many sources. Deleting seems like a good approach for a work around, but many people have home directories chalk full of "good" stuff, that they really can use (if only they could correctly link it all together/index it in a timely fashion) and wouldn't think of deleting, because there WAS a reason they got it in the first place, and there is still a reason to keep it. Pruning is OK, but don't chop down the tree. (I agree with keeping your binary trees clean, why not right? But data is a little different. Isn't that part of the PC, the whole reason, we have a PC and not just some terminal?)

[reply] [top]

[»] Re: It all comes down to organization skills
by Eli Sand - Aug 31st 2001 01:10:45


> ... because usage doesn't involve
> large volumes of information from
> many sources.



When you have large volumes of information from many sources, that is called a data repository, also known as a library. When you have a library, you have an interface to get the data you want. I believe that a filesystem should do nothing more than what it's original intent was - to store data. If you want to retrieve data by 'searching' the stored contents of all your files, you should be using some sort of interface to retrieve that data. It's not the fault of the OS or filesystem that you can't find your stuff - it's your fault.



> ...Deleting seems like a good approach

> for a work around, but many people have

> home directories chalk full of "good"

> stuff, that they really can use ...



Organization. If you're unorganized, all that 'good stuff' is theoretically useless if you can't find it when you need it. Also, if you keep enough junk around that you think is 'good stuff' and you never really use it, chances are when you go to use it, it's outdated with something far superiour, or something else completely different (*laugh* ... ipfw, no wait, ipfwadm? nonono, ipchains!, no wait... iptables - that's it!).

Like I said, it all comes down to organizational skills - if you aren't adept at keeping structure in your data, you shouldn't be allowed to find what you want when you want it.

[reply] [top]

[»] Re: It all comes down to organization skills
by Allan Fields - Aug 31st 2001 01:36:27


> When you have large volumes of
> information from many sources, that is
> called a data repository, also known as
> a library. When you have a library, you
> have an interface to get the data you



I don't yet.. that's exactly what I need, but I don't have an interface to "get the data" -- just a FS which I chuck stuff
in until I do. Where else would I put it lacking a repository?
I could put it into a temporary repository that doesn't have all the features I need yet like, say building a searchable web page set. But why do that if I can do it once, properly. No half-measures!

Yes it is a problem of knowledge management.
The knowledge management systems that exist
don't do it for me, and many are commercial, so
they are closed. No thanks. Too many bad experiences.

None of them integrate tightly enough. I sited Livelink already, that is a good example of something that makes the web a repository, but is commercial and doesn't have everything I would need to setup the repository properly.



> want. I believe that a filesystem

> should do nothing more than what it's

> original intent was - to store data. If

> you want to retrieve data by 'searching'

> the stored contents of all your files,

> you should be using some sort of

> interface to retrieve that data. It's

> not the fault of the OS or filesystem

> that you can't find your stuff - it's

> your fault.



I'm sorry, I know I screwed up, I'll do better next time. I knew that the filesystem wasn't designed for what I am trying to do..
That's why I've just been storing data on it, like it was intended to be used.

But wait a minute, that was my point! Maybe the filesystem needs to be extended -- not necessarily all in the kernel space, but the user space as well!



> Organization. If you're unorganized,

> all that 'good stuff' is theoretically

> useless if you can't find it when you

> need it. Also, if you keep enough junk

> around that you think is 'good stuff'

> and you never really use it, chances are



Actually it is still useful, it's just it is less likely it will be used effectively.



> Like I said, it all comes down to

> organizational skills - if you aren't

> adept at keeping structure in your data,

> you shouldn't be allowed to find what

> you want when you want it.



I'm allowed to do what-ever I please, regardless of qualification or skill with my own hardware (with-in bounds of law). I think I should be able to. And I am trying to better keep structure *in my data* -- not my head! That's why the computer should store structure or metadata, not just data that requires us to worry about the inforcement of the structure as an after-thought. That is error prone, as we are not machines.

Isn't the computer suppose to help us store, retrieve and compute information. Why not design it better to do so? I'm not the computer, the computer is the computer. I want to be able to have it present the information in an optimal and timely fashion, that can offload some of the burden of remembering the structure of my data.

---
Allan Fields

[reply] [top]

[»] Right On!
by Allan Fields - Aug 30th 2001 04:09:48

I think you raise a very important issue. This all makes me think of the "Future Vision" section of the Namesys page, where Hans Reiser talks about the need for a mathematical closure between applications and bringing more advanced features into the FS much like the idea of 'the database is the filesystem'.

Sometimes I am rather stumped as to how I can organize all my files well, simply because of the sheer volume. Some times I wonder where it all comes from. =) One thing is for sure, the current system isn't making it natural to file it all away as it comes in.

Because I hate recasting my thoughts into the separate islands that are file formats and specifically new file formats that I am unfamiliar with, just to decode them at a later date, I don't feel like I can naturally organize my files, ideas, correspondence, etc. in an intuitive or overall advanced fashion. There is always the fear that once it is all organized, it will be organized in the wrong file format/directory structure for when I need to use it all again or that when the next file format comes along, I'll have to do it over again. Then there is the issue of remembering what the files that I place in directories are for, and where/when I saved that specific file I have in mind, and which files did it related to -- and how does it tie into the concept... Add to that the number of computers I use for personal use, work, etc. It all seems unmanageable if I don't just cast off the old, and archive it all up for that 'some day, when I am gonna organize this all'.

Additionally, when I try organizing PDFs (that contain specs or product information for instance) and all the other files that I yank off various pages from the Internet, I am at an even greater loss as to how to integrate it into a logical structure and relate it to the existing files. Much like a problem where you can define so many bins that you don't know or remember which to drop something in, thus defeating their purpose. What if it belongs to 2 or more drop-boxes. And It sure would be nice to link those PDFs to the source websites and to the related searches on Google.

How do I save my concepts of linkage between all these files and URLs and emails. I can't remember it all, especially after a few years. It's a mess!!

I fear the trend towards a small countries population of different solutions above the scope of the OS, with thousands of different approaches of implementing the same structure over and over, with-out any coordination. XML can help, but I am not confident that any markup/hypertextual system alone is enough for anything past a level of interoperability. Not like interoperability would be bad or anything, but...

It might be nice if we could get something to organize it all, that is a unified standard, and heck, even cross platform. Can anyone suggest what I should look at?

GroupWare (PHPGroupware looks promising) and tools like Livelink (not Open Source) are a start, but they also fall short in that they all just build on top of the file system and operating system removing the convenience of the UNIX 'everything is a file' accessible at the flat scope idea. Maybe our filesystems just aren't advanced enough to handle the load we are trying to put on them.

In the case of commercial tools that might suite my needs, they are all to pricey. I also refuse to store my data in any Windows file formats.. too many bad experiences, I don't buy the Microsoft integration concept.. Call it a lack of trust that Microsoft can ever be compatible with-out sucking my time and dollars into a downward spiral of non-addressable bugs and unholy propriety that requires me to switch from Windows at a great cost of time in the end anyway.

I might have some ideas for the KDE team so that they can avoid the same problems. Can we please get past this dark age of the stand-alone application to something that finally draws some closure? ;)


---
Allan Fields

[reply] [top]

[»] The tree structure is one problem
by Jerry - Aug 30th 2001 05:08:25

I also ran into the data organization problem in 1993 (when I last time lost files).

I found among others the file metaphor and strict tree structure a major mismatch with human cognition. To tackle the problem Askemos was done. It really helps.

BTW: Askemos is a GPLed software (soon to be recategorized at freshmeat), wich faces a legal threat at the moment. Please help to keep it free, download! Thaks

[reply] [top]

[»] Re: The tree structure is one problem
by Allan Fields - Aug 31st 2001 02:07:50

Jerry, I've taken a brief look, and find the structure a little daunting (also unfortunately I don't speak much German :( ) -- some of conepts seem neat.. I am interested to find out more, so I'll take another look some time soon. Good to see people working on solutions... One thing we perhaps should be careful of is to allow these solutions to have a tight level of integration to existing facilities so that they are intuative to users and don't appear to be a layer on a layer on a layer of storage (The multiplicity of trees - as mentioned above). Yours appears to also be an anonymous sharing protocol?

[reply] [top]

[»] Re: The tree structure is one problem
by Jerry - Aug 31st 2001 05:14:00


> I've taken a brief look, and find the
> structure a little daunting (also
> unfortunately I don't speak much German
> :( ) -- some of conepts seem neat.. I am

Thanks. Yes, I know there is several years of work to be documented. I appreciate all comments on how to improve documentation structure. Promise: german gonna be translated.


> One thing we perhaps should be careful
> of is to allow these solutions to have a
> tight level of integration to existing
> facilities so that they are intuative to
> users and don't appear to be a layer on

That's a main point of Askemos. It was actually started, when I realized, that I can understand files, but my dad, a philosopher, could not.

It's certainly not his fault.


> a layer on a layer of storage (The
> multiplicity of trees - as mentioned
> above). Yours appears to also be an

That's about technology. Askemos stores it's data in one repository (two files, provided by rschemes pstore moduel). Within that repository, you find internally hash tables and document trees.

The technology is called pointer swizzling at page fault time, which says it all.


> anonymous sharing protocol?

Not exactly yet. There is one needed.

Askemos is by definition based on standards wherever feasable. For the sharing, I currently go though SOAP. This is not the final solution.



[reply] [top]

[»] when in doubt use brute force
by rumblefish - Aug 30th 2001 05:18:08


I think all these fancy techniques are not really
needed. Look at history: there were a lot of early
search engines and systems designed by
architecture astronauts, such as WAIS which have
never got anywhere. In contrast, look at the
absolutely brilliant google, which cares nothing for
categories or semantics. I use google in for
everything, in preference even to categorised
vendors support pages for my support issues.


When in doubt, use brute force.
http://www.tuxedo.org/~esr/jargon/html/entry/brute-force.html

[reply] [top]

[»] Re: when in doubt use brute force
by Matthias Arndt - Aug 30th 2001 05:37:16

Why use a search engine in your $HOME? Simply delete files not needed and backup everything you may need in the future to an external storage device like a tape archiver or a cd-r. Your $HOME will stay small and tidy. Just make sure to go through this procedure once a week or once a month. My $HOME is organized that way. However I store HTML, downloads, pictures and other non-plain-text information in there. I use some well known subdirectories and it works perfectly. Simply tidy up! Why using complex software for things that can be achieved with a little self discipline or even cronjobs?

--
... from Matthias Arndt ICQ: 40358321 Homepage auf Anfrage

[reply] [top]

[»] Re: when in doubt use brute force
by Greg Holt - Sep 7th 2001 07:56:31


> Simply delete files not needed and
> backup everything you may need in the
> future to an external storage device
> like a tape archiver or a cd-r.



I've seen this recommended by several folks, so don't think I'm singling you out...

Simply archiving and deleting things does not solve the problem, they make it *worse*. How do you find out what the hell you've archived? "Gee, I know John sent me an article he wrote on graphing small population relationships, but which of these 50 CDs or 150 backup tapes did I put that on?"

Greg

[reply] [top]

[»] What about using the remembrance agent
by virtualizer - Aug 30th 2001 07:07:02

I have more than decent success by using Bradley Rhodes' Remembrance Agent. That does a very good job by trying to provide me with JITIR.

[reply] [top]

[»] Re: What about using the remembrance agent
by Jean-Marc Liotier - Aug 30th 2001 07:37:39

Trees are inherently limited to single entry. Organizing documents in a single tree will inevitably hit that wall. The only way to break through is to use thesaurus based keywords. The snag is that thesaurus building is a task of pharaonic proportions.

The quick and dirty approach that I used successfully when in dire need of hacking my way through 40GB of ps, pdf, txt, doc, ppt, html and xls documents is to use a full text indexer with external parsers. ht://Dig has done a great job (although phrase searching sorely lacks for now).

Thesaurus based keyword indexation is best because documents can be hit from any semantic angle. I would love to have the time and resources to do it for my company. But in the real world, meaningful file names, a basic and sane tree and full text indexing on top of that will do cheaply.

As far as mail is concerned, the single entry tree problem is somewhat alleviated by virtual folder approaches such as with Evolution

[reply] [top]

[»] Re: What about using the remembrance agent
by Jean-Marc Liotier - Aug 30th 2001 07:41:46

Sorry, I hit "reply" and forgot to modify the title. My post's title should read : "Experience dealing with large numbers of heterogeneous documents". Relational data rules !

[reply] [top]

[»] There are tools...
by Sorin Milutinovici - Aug 30th 2001 08:10:50

I had the same problem. Until I discovered that
there are a lot of tools that can help. Sure, one
has to find all those tools and select the best of
them. The starting point for me was the desire to
have one (or two) places in which my important
stuff goes. Ideally a common interface for all this.
And the only environment that is ready to deal with
all sort of objects is the web. Therefore, my way of
solving the problem is:

- Use a perl, php enabled web server for your own
computer
- Use a Personal Information system (there are
several out there, I use MyPhPPim) with a web
interface, connected with a mysql database. In
that database goes all your E-mail, notes, todo-s,
etc.
- Use a bookmark manager connected with the
same Mysql server and with a web interface
- Use a web file manager system (such as
phpFileFarm) to work the pdf, html, ps files
- Use a web photo album to keep your photos (of
course with database back end)
- Use a cvs system for ASCII work in progress and
install a webcvs system (I use viewcvs).
- Finally, use HtDig or another search engine to
index the whole stuff. Configure htdig to search in
separate directories or in all.

Several more ideas:

use the same database engine (mysql or postgress
or another) to minimize the load
back up on a separate partition (or computer) all
the databases and the cvs system daily
back up the pdf, ps html directory weekly

And to add a little touch, make a script that
checks daily into the cvs system:
ls -lR in important directories
system settings

Your computer will have to work during the night
for one hour but...you have a clever system

--
Sorin M

[reply] [top]

[»] Re: There are tools...
by Caglios - Aug 31st 2001 03:33:26

Yes, the tools are there. But more often than not you need to write them yourself. Only in the last few months have I got my scripts down so that not even a tmp file escapes my wrath (Yay for PERL). The overheads for this probably aren't worth it, and there's still a few bugs. The package (as yet unreleased) needs to work at a relatively low-level to query the fs to see which files have been opened (it presently only works on x86 machines) and another cron to take an image of the complete filesystem once a day, compare it agains the previous day, see which files have been opened in comparison, and stores this and other data in a mySQL table. Then... every month, like clockwork, I switch my pootie on and it takes about an hour to archive all of the unused files for the period. After that, it's just a matter of scanning through the .zip's and removing what I don't really need. Seems a bit gratuitous, really. But it works.

--
TIAS Systems inc. 2001 http://www.geocities.com/dacagman

[reply] [top]

[»] Re: There are tools...
by Allan Fields - Sep 2nd 2001 05:26:14


> Yes, the tools are there. But more
> often than not you need to write them
> yourself. Only in the last few months
> have I got my scripts down so that not
> even a tmp file escapes my wrath (Yay
> for PERL).

That is a good point, some times the best way to do it is your own script. I am also fond of Perl for some tasks.


> Seems a bit gratuitous, really. But
> it works.

Hmm.. seems like a good way to archive, but remember archiving offline isn't always the right/full solution.. Depends on peoples usage patterns I guess. :)

[reply] [top]

[»] Re: There are tools...
by Allan Fields - Sep 2nd 2001 06:58:20


> I had the same problem. Until I discovered
> that there are a lot of tools that can help.

I've been looking for tools, but even if I found a tool for each application (and it was open say), it still doesn't solve the closure issue fully. You can get pretty close though by using all web based tools.


> has to find all those tools and select
> the best of
> them. The starting point for me was
> the desire to
> have one (or two) places in which my
> important
> stuff goes. Ideally a common interface
> for all this.

Yeah, that would be nice to have one interface for all of the tasks you mention. Also, can you post a small list of links to the packages that you have found? That might be helpful for everyone here trying to setup a repository. I have searched Freshmeat and SourceForge but haven't yet got a good idea of what all exists and the extent of the work on these solutions. I know there are already a lot of commercial solutions to do these types of things... I imagine most are for large business/project management/office problems though. I wonder if any exist for research work.


> And the only environment that is ready
> to deal with
> all sort of objects is the web.
> Therefore, my way of
> solving the problem is:
>
> - Use a perl, php enabled web server
> for your own
> computer
> - Use a Personal Information system
> (there are
> several out there, I use MyPhPPim)
> with a web
> interface, connected with a mysql
> database. In
> that database goes all your E-mail,
> notes, todo-s,
> etc.
> - Use a bookmark manager connected
> with the
> same Mysql server and with a web
> interface
> - Use a web file manager system (such
> as
> phpFileFarm) to work the pdf, html, ps
> files
> - Use a web photo album to keep your
> photos (of
> course with database back end)
> - Use a cvs system for ASCII work in
> progress and
> install a webcvs system (I use
> viewcvs).
> - Finally, use HtDig or another search
> engine to
> index the whole stuff. Configure
> htdig to search in
> separate directories or in all.
>
> use the same database engine (mysql or
> postgress
> or another) to minimize the load

I agree with trying to get everything into one DBMS at least, even if there isn't seemless integration. Even more ideal is to have a strong level linkage between all the member DBs of the DBMS.
Also, it appears PostgreSQL and MySQL are a little behind Oracle in some of the Object over Relation framework features. Even nicer is the OO or Object-Relation ODBMSes like Cache, DB40 or (open source example) GOODS and Gigabase.
On the DB access layer another project that caught my eye was ColdStore (persistence framework using simple DB). And then there is J2EE for Java which is something to look at for Java apps.

There are lots of things to look at, and there are many projects adressing specific sections of the problem...
---
Allan Fields

[reply] [top]

[»] Re: There are tools...
by Sorin Milutinovici - Sep 3rd 2001 07:33:20


> I've been looking for tools, but even
> if I found a tool for each application
> (and it was open say), it still doesn't
> solve the closure issue fully. You can
> get pretty close though by using all web
> based tools.


Yes, web based tools are probabily the most complete ones. And, yes, sometimes you are just getting very close. But most of the time, since the web tools are (some of them at least) rather standard you can adapt yourself to the tools.


> Also, can you post a small
> list of links to the packages that you
> have found? That might be helpful for
> everyone here trying to setup a
> repository.


I will post several links but, as usually you should check for yourself. Especially open source projects are sometimes moving very fast. And let's hope that others will reply adding some more.

About the repository: I use the common cvs system that can be found at

www.cvshome.org The cvs from there can be used fron the command line, it has no graphical interface or web interface. But once you've set a repository (or more) you can use several tools that are available:

CVSWeb
http://stud.fh-heilbronn.de/~zeller/cgi/cvsweb.cgi/ This is a single perl script that does not need a database. You need to have the repository set up and that's it

ViewCVS
http://freshmeat.net/projects/viewcvs/
The one I am using now. It is based on CVSWeb but is in phyton. You can download tarballs from your repositories and it has syntax highliting for a lot of file types (based on enscript). It can be used with a MySql database but this is not compulsory. The database does not keep the repository, just information.

Chora
http://horde.org/chora/ I haven't personaly tried this one. But I've seen some online repositories and in matches pretty close the previous one.

Freepository
www.freepository.com
The one I'll use in the future :) if I have time to move all my stuff from Mysql to Postgresql. It is a full web based tool, checkin, checkout, whatever. Postgresql backend.

For CVS documentation ot tutorials: go to the ViewCVS site, there are several links.

In principal, my web site has to have:

A news system - something that grabbs the news from slashdot, freshmeat, etc.
A Calendar
A bookmark manager
A place for notes
A place to put small articles that I find on the web
A photo gallery
An E-mail system
A file manager
CVS Interface
An interface to the computer administration tools
web ssh login

There are several tools that can do this. I will mention two (although I am sure that more - maybe better - can be found.

PhPGroupware - multiuser groupware tool that has everything in the above list, except the last two (as far as I know). The cvs interface is chora, mentioned above. Very activelly developped (if you go on Sourceforge you will almos always see it on one of the first three places. www.phpgroupware.org

PhPNuke + several modules (News, Gallery, Calendar, etc) All can be found on the phpNuke site: www.phpnuke.org PhPNuke is a system for building news sites but you can use it for all of the above list, except the last 5.
I am using now PhPNuke. For other tasks:

E-mail system: There are a lot of webmail programs. If you have the mail delivered to your machine then you can use a href= "http://neomail.sourceforge.net/">Neomail or Openwebmail and much more.

Web based file managers: an interesting one is PhPFileFarm

Interface for administration: the best one seems to be WEBMIN (I am running Linux, you should check their page for other systems).
http://www.webmin.com/webmin/
Webmin has also a file manager and a ssh login shell, and much more.

Or you can use a combination of:

MyPhPIM http://sourceforge.net/projects/myphpim/ which has mail, calendar, todo, addresbook
and other tools described above.


> Also, it appears PostgreSQL and MySQL
> are a little behind Oracle in some of
> the Object over Relation framework
> features. Even nicer is the OO or
> Object-Relation ODBMSes like Cache, DB40
> or (open source example) GOODS and
> Gigabase.
> On the DB access layer another project
> that caught my eye was ColdStore
> (persistence framework using simple DB).
> And then there is J2EE for Java which
> is something to look at for Java apps.


I am not very familiar with object oriented database. Postgress has table inheritance though. But, for such a project (that is personal therefore single user) a lightweight database seems the best choice. This is why I am not yet convinced to move my system fropm Mysql.


>
> There are lots of things to look at,
> and there are many projects adressing
> specific sections of the problem...


This is true. It will be more than nice to start a project for this - a personal web content manager. Oh, I mentioned HtDig. It is a search/indexing engine that can be found at:
www.htdig.org

>
> ---
> Allan Fields

--
Sorin M

[reply] [top]

[»] File naming rules
by Gavin Brown - Aug 30th 2001 10:36:47

More thoughts on file naming rules:

http://www.everything2.com/index.pl?node_id=530288

[reply] [top]

[»] Storing files
by Thomas Leonard - Aug 30th 2001 11:33:55

When a file is being downloaded, the user is required to supply a filename and a path. I would really like it if authors of software (like Mozilla) gave us a commandline with file completion to do this. I find the GUI interaction that they force me to have extremely inefficient, and it costs so much time that when I'm in a hurry, I tend to misclassify an incoming file.

This is perhaps the biggest problem -- it's so easy to just dump a file in the default directory that people don't take a couple of seconds to put it somewhere sensible.

A solution? Get rid of the save dialog box and replace it with a draggable icon. To save, the icon is dragged to a filer window, directory on the panel, etc. Common save destinations (eg, the project you're currently working on) can then be kept handy along the bottom of the screen (or whereever). See here for an implementation of this system.

As any computer scientist knows, spending a little extra time storing your data can help a lot when it comes to retrieving it! BTW, I agree that an indexing agent should update as the filesystem is changed. The current massive-scan-once-a-day is slow and irritating.

[reply] [top]

[»] Re: Storing files
by Allan Fields - Aug 31st 2001 02:12:34

All good ideas, I think these type of UI inovations are what we all need!

[reply] [top]

[»] Re: Storing files
by Allan Fields - Aug 31st 2001 02:52:42


> A solution? Get rid of the save dialog


Actually, come to think about it, no reason to rid of it, just implement another approach and allow them to be configured on or off.

[reply] [top]

[»] Re: Storing files
by Adam Glasgall - Sep 8th 2001 21:48:00

Didn't Acorn's RiscOS do this wrt saving stuff?


> A solution? Get rid of the save dialog
> box and replace it with a draggable
> icon. To save, the icon is dragged to a
> filer window, directory on the panel,
> etc. Common save destinations (eg, the
> project you're currently working on) can
> then be kept handy along the bottom of
> the screen (or whereever). See here for
> an implementation of this system.



[reply] [top]

[»] Re: Storing files
by Thomas Leonard - Sep 10th 2001 08:15:17


> Didn't Acorn's RiscOS do this wrt saving
> stuff?




Yep; my implementation looks very similar to it.

[reply] [top]

[»] Mail / FS
by belg4mit - Aug 30th 2001 13:39:11

This is exactly what mh-mail /nmh is designed for

"nmh consists of a collection of fairly simple single-purpose programs to send, receive, save, retrieve, and manipulate e-mail messages. Since
nmh is a suite rather than a single monolithic program, you may freely intersperse nmh commands with other commands at your shell prompt, or write custom scripts which use these commands in flexible ways."

http://www.mhost.com/nmh/

And if you must have a GUI there is xmh and exmh, or mh-rmail for emacs etc...

[reply] [top]

[»] Data and metadata
by Manuel Amador (Rudd-O) - Aug 30th 2001 18:40:28

The solution to these problemas has been discussed in tom's hardware. The proper solution is to have a filesystem that stores metadata, such as ReiserFS, and a unified interface to it, such as OMS (a XML dialect and categorization/metadata standard for storing metadata).

Naturally, it would require operating system kernel support, application VFS support and application front-end support, so it might as well be an herculean task. Whatever approach is used to solve the problem, it has to keep in mind that dumping the metadata while transferring files across the internet is unacceptable. MacOS had that solved with bundles. Why they dumped support for it in Mac OS X, I don't know.

[reply] [top]

[»] beginnings
by Zen Lunatics - Aug 30th 2001 21:15:04

I've been wanting a non-hierarchical organizational system for quite some time. My main reason for wanting this is to organize browser bookmarks that can belong to more than one category. So, I've written the beginnings of such a system which can be found at zenlunatics.com

It's currently somewhere around the alpha stage and I haven't worked on it in a while. I haven't written a bookmark manager yet but did write an image viewer, an mp3 player, a simple note keeper and a utility for creating catalogs from a file system. For the bookmark manager I'm thinking of modifying gnobog, galeon or maybe mozilla (suggestions welcome). After that I'd like to like to tackle the file system possibly with a document launcher although I recently read about multi-session support which may solve that problem in a different way.

Anyway I'd really appreciate any comments on zl_catalog including suggestions for a better name :-)

thanks,
sean

[reply] [top]

[»] Re: beginnings
by Allan Fields - Aug 31st 2001 02:36:00

Hi, Looks good, I think you and all the other authors that have been working on these types of projects are heading in the right direction. We need to make sure we can bridge between all the apps, solutions, FS, transport mechanisms, etc. The library is definitely a great idea. Also an exhaustive effort is probably required to rival the integration of some commercial environments where integration is a goal and part of the project. I have visions of what the filesystem should be like and how it should interface to the UI/shell. They are in many ways in agreeance with Reiser and the original Macintosh vision and in some aspects of Windows (all though I am no Microsoft fan) -- and many different schools of thought! I definitely agree with the author of the originating post, he has got some great points!! Thanks to all that are working on a solution to this existent and persisting problem of computer science (which may have been solved already in some past era, if only we could revive the great softwares of the past!!! -- and which might already be solved already in some expensive commercial package that I can't afford and wouldn't want to use because of the software model.)

[reply] [top]

[»] Dumb data and the file system
by Michael - Aug 31st 2001 07:29:34

I think that this article is brilliant, and also enjoyed the comments.

My perspectives on the issues raised are this.

I think there is a fundamental legacy that is difficult to overcome - of course it could be, but it is made very difficult by the commonly accepted underlying abstractions.

The UNIX operating system made a very effective design decision "everything is a file". Making the abstraction of "file" more useful than it had been in previous operating systems. This provided benifits similar to the benifits of closure in algerbraic mathmatic structures. The most obvious being the ability to use small programs in concert using pipes. There was a high level of consistency in the implementation allowing greater productivity for users. Allowing general utility programs to be very usefull.

However a "file" is just an abstraction. There are also downsides to treating everything as a file. The most obvious being that you need some intelligence about the data in the file to gain higher levels of usefullness.

The continued use of "files" as the dominant user interface abstraction for data storage - while at the same time loading more meaning into that data - has lead to both monolithic applications, and monolithic file sizes and increaing complex file types.

So in the examples raised, we have "bookmarks", "mail messages", "pdf files", etc, which are different concepts which are managed by different applications. by creating different applications to deal with them, we have lost something though, we have lost their similarities. It IS usefull to think of them at both the level of "file" - a chunk of data, and at a more meaningfull level - "bookmark".

Relational databases also had a brilliant idea - everything is a table - and also raised the usefulness bar in some contexts for a whole bunch of reasons, including greater description of what the data was.

However, I disagree with the idea of making a relational database interface to the underlying data storage to be the only way to access information. The data in relational databases is weakly typed. The abstraction is brilliant, but it continues to encourage the seperation of the data from the meaning. Meaning that you still loose generality. i.e. the data is still too dumb.

So, I think, there is a fundamental problem with the underlying abstraction "file" (specifically as the user interface), and I dont think progress is made by using the abstraction "table" or for that matter "XML file".

We do however have an abstraction that could serve as the basis for systems that could meet at least dodge some of the objections raised. Here goes...

"Everything is an object"

A system that was baised on the user interface to data being a persistent object store, I think would provide a foundation that more easily lead to the desired features.

With objects you have an explicit type heirarchy. giving you the ability to manipulate objects either at a high or low semantic level. giving you both general and specific tools.

An underlying object repository, also gives you a way to deal with the legacy problem of files. You can easily rap a file into an object, if you happen to not have access to the underlying semantics.

Going further, different views onto your object repository and different ways of locating the object(s) you want are required - but I think as an abstraction objects are a much better starting point than files.

Just for clarity, I am talking about the abstraction presented to the user interface. I am not advocateing what I think should be used as a physical storage abstraction.

In the (very intereesting) ResierFS documentation - it makes an argument for moving semantics into the file system implementation, which is another way of saying that there is no clean decomposition that is general between the user interface abstractions and the data storage abstraction. but then where ResierFS is heading could be used as a persistent object store.

SO, as an OO bigot, objects (and then a lot of hard work) ARE the panacea ;-)

- Michael.

[reply] [top]

[»] This problem is solved very neatly already. Has nobody noticed...?
by Alex Farrell - Aug 31st 2001 09:48:22

BeOS solves this problem very nicely, using something similar to the suggestion in the first post.

The filesystem (BFS) allows attributes (arbritary data streams) to be attached to filesystem elements, and these can be indexed by the OS. Queries can be performed on the filesystem based on attributes which are served as a "directory"

This makes organizing files very easy.

For example, the ID3 tags of mp3 files can be stored as attributes attached to each file. All mp3 files are then stored in a single directory, and a query (psuedo-directory) is created which shows all files belonging to, for instance, the "Rock" genre. Another query is created which shows all files by the Rolling Stones. Another could show all tracks written in the 80's etc etc. A particular track might show up in 1, 2 or all of the queries, and this allows you to get to what you want very quickly.
Feel like listening to the Stones? Just drag all files in the Stones directory (query) into your mp3 player. Feel like a rock evening? Also easy. Maybe you want to listen to all the old Stones stuff - easy again; just create a query for all Stones music before the 70s.
Your queries can be stored for later use.

These queries are provided at a filesystem level, which means that all applications can use them transparently. They are also instant, since they are indexed.

Problem solved.

Of course now that BeOS has been slain, it's probably not a good OS in which to invest your time. AtheOS ( http://www.atheos.cx ) promises similar facilities in the future, but it's not there yet.

Anyone feel like working on a cool new filesystem? Maybe you should contact the AtheOS author (I don't know him, so maybe he's not interested in support, but maybe he is).

[reply] [top]

[»] Re: This problem is solved very neatly already. Has nobody noticed...?
by Julian Regel - Sep 16th 2001 15:27:19

To expand on the above, BeOS did the above well because it was so integrated into the OS and all applications made use of it. Scott Hacker had a great article on BeOS filetypes at www.byte.com a while back that explained how apps such as the web browser would automatically fill in certain attributes for the user (such as the source URL, date it was downloaded, mimetype etc), and that mp3 rippers would populate the song title, author, track length etc. My understanding is that the issue of filesystem metadata has been discussed on the Linux kernel mailing list, and Linus and co are trying to work on a proper implementation.

[reply] [top]

[»] Filesystems, objects, databases and a command line interface...
by Jacob Sparre Andersen - Sep 1st 2001 04:52:35

Thanks for this inspiring article and comments. It got me started thinking about how such an information retrieval system can be organised without dropping too many of the benefits many people like with their Unix-like systems. There is not necessarily much new in the text below.

Information about files should be stored in a structured form (i.e. a "database"). This is the information I imagine is relevant to index:

* author
* language
* title (filename?)
* keywords
* description
* file type
* encoding
* creation and modification times
* projects
* categories
* dependencies (A is constructed from B and C)
* relations (if you are interested in A, then you are also likely to want to read B)
* full text

"URL symlinks" should be available as an alternative to "bookmark files". And they should be treated as if they were file they refer to.

The indexing system should recognise file types and extract as much as possible of the abovementioned information from the files.

"tar", "zip" and mailbox files - as well as other composite file types - should be indexed both as a whole and as the individual components.

If it is possible, then the indexing system should receive information from the file system, when files are created, modified or removed. Secondarily, the indexing system will have to scan the file system for changes on its own.

There should definitely be some kind of shell/command line interface to the system. And it should of course include file name completion-like features.

It should be considered to implement a virtual filesystem for formulate queries in the database. It would be great if I could do something like this in my favourite shell:

$ <x><v>< ></><tab>
[ the program `xv` can only read images ]
$ xv /.filetype/image/<.><c><a><tab>
$ xv /.filetype/image/.category/L<tab>
Choose:
L(i)nux
L(E)GO
$ xv /.filetype/image/.category/L<i><tab>
$ xv /.filetype/image/.category/Linux/T<tab>
$ xv /.filetype/image/.category/Linux/Tux.png

(not bad, I would say)

It would be nice, if filesystems could store more of the basic information about files (for example file type, encoding, language, author, keywords and description).

/Jacob

PS: Why are the "UL", "OL", "LI" and "BLOCKQUOTE" elements banned from the HTML formatting? :-(

[reply] [top]

[»] Indexing PDFs
by Marco Schmidt - Sep 3rd 2001 23:10:58

I don't think having a bib file for every PDF is a good solution. PDF provides the possibility for metadata inclusion (pdflatex lets you do this, and I'm sure Adobe's own tools as well). Unfortunately, almost nobody is using these correctly. You can store title, author, date, keywords and probably more in a PDF. Command line tools like pdfinfo (or CTRL-D in Acrobat Reader) show this information. Additional files always get lost or do not get updated, so store the information in the files!

[reply] [top]

[»] Problems with superstrings
by Marco Schmidt - Sep 3rd 2001 23:18:23

The idea is good, there is just a minor problem with character encoding. I don't say this cannot be solved, but I noticed that with latex special characters often get described in a visual way - e. g. a small 'a' with two dots on top of it. While this results in nice output, the information that this really is an umlaut 'a' (as in German Bär or Jägermeister) gets lost (does not get stored in hte dvi or pdf file that is the result of the (pdf)latex run). So searching for anything with ä in it doesn't work anymore.

[reply] [top]

[»] Another ERP-System
by Holger von Ameln - Sep 4th 2001 04:16:52

for those of you in need for a linux based ERP-System.
Have a look at http://www.pentaprise.de.

[reply] [top]

[»] MacOS X "bundles"
by Ask Bjoern Hansen - Sep 6th 2001 07:19:43

In MacOS X we have "compound files" which can be navigated like the directory structures they really are. In the MacOS Finder they look and work like a single file, but after opening a shell you can easily see what's inside.


- ask

[reply] [top]

[»] See also.. Story on OSNews
by Allan Fields - Sep 7th 2001 11:27:43

http://www.osnews.com/story.php?news_id=69

Are Linux meta-data enabled filesystems ready for production? Never hurts to try out something new on a test machine. I plan to look at and compare the various file systems discussed in this article.

On another note, XML databases are very interesting indeed. Take a look at these handy resources: http://www.rpbourret.com/xml/XMLAndDatabases.htm, http://www.rpbourret.com/xml/XMLDatabaseProds.htm. These pages describe XML database solutions and talk about how XML and databases fit together. Also mentioned is various XML databases and at this point there is a large, large, pool of XML database projects both commercial and open source. Some examples of XML/OO database products: Prowler, Ozone, etc.
XML databases might be a key element of how to get this to the user level in existing solutions. Also take a look at the section describing DTD schema translation, in what is described as 'object-relational mapping' method of document-centric XML files to object frameworks and then to the relational database backend such as say an SQL database like PostgreSQL perhaps.
Since I'm no XML expert some of this is new to me.

Well hope someone still reads through this thread, it seems a bit dated by now.

---
Allan Fields

[reply] [top]

[»] Isn't Oracle IFS the answer (albeit an expensive one)
by Raul M. Jorja - Sep 19th 2001 01:26:32

I do not have any experience in Oracle IFS (Internet File System), but from what I have heard and read about it seems to be the answer.
I saves everything in an RDB (obviously Oracle), so you can add tags, metadata, etc. (it even handles XML) and then you can query it via http, nfs, smb, imap, etc. any protocol!

[reply] [top]

[»] command line vs gui
by grom - Oct 3rd 2001 06:06:56

in regards to being able to call up the command-line interface when saving. Why not intergate command-line functionality into the gui. For example why not have path name completion in the filename text box. So a user can use either the gui or command-line. I feel that gui need to have the power of command-lines, but I'm still yet to witness this. Another idea I have is, a find file dialog (like that in windows explorer) that allows for regular expressions. Why give up the command-line for a gui interface when we can have the command-line built-in the gui.

[reply] [top]

Copyright ©1997-2001 Open Source Development Network. All rights reserved.
Advertising · Privacy Statement · Terms of Use