[nSLUG] Removing spaces from file names - note, long lines

Vlado Keselj vlado at dnlp.ca
Wed Nov 12 23:03:59 AST 2014


Hi Hatem,

On Thu, 13 Nov 2014, Hatem Nassrat wrote:

> Hi Vlado,
> On Wed Nov 12 2014 at 10:32:22 PM Vlado Keselj <vlado at dnlp.ca> wrote:
> 
>       This one-liner does not do the same operation.  My main goal was
>       to turn one or any chosen filenames that look like:
> 
>       Joe, Jack & Jane's document (version # 1) [draft].doc
> 
>       into something like:
> 
>       Joe--Jack_and_Jane-s_document_-version_=23_1-_-draft-_.doc
> 
> 
> Saw your perl, honestly didn't mentally fully compile it, but it seems 
> like it would do the transformation as described. What is the reasoning 
> / requirements for the use of this kind of transformation, is it so you 
> can revert back to the original name?

Not really, but kind of close :)  If I wanted a completely safe filename 
that can be reverted back to the original name, I would use something 
like:
http://web.cs.dal.ca/~vlado/srcperl/snip/encode_w 
and then to decode:
http://web.cs.dal.ca/~vlado/srcperl/snip/decode_w 
or, directly as Perl regex'es:
s/[\Wx]/'x'.uc unpack("H2",$&)/ge;
and
s/x([0-9A-Fa-f][0-9A-Fa-f])/pack("c",hex($1))/ge;

With the script, I tried to capture what I normally do to an unsafe 
filename to keep it still readable, and to try to keep most of the 
information intact.  This is why it is more complicated than other 
solutions: it is trying to emulate what seems a reasonable and readable 
transformation.  I changed it over time iteratively.  It evolved as I saw 
new examples how people name files.  Here are some explanations:

s/ +- +/-/g;

Many people will do something like "file - version 1".  In my opinion,
file_-_version_1 is ugly, so the above reges will make it:
file-version_1  which looks better.

s/''+/--/g;
s/'/-/g;

the above is based on experience.  Single quote is -, but repeated single 
quite (and it happens) is --.

s/[[(<{]/_-/g; s/[])>}]/-_/g;

How to replace something like "Report(draft) version 1"?  I decided that
Report_-draft-_version_1 is what I am looking for.

s/[,:;]\s*/--/g;

Punctuation (,;:) is replaced with --.  Again, it is a separator, so it 
seems logical

s/&/and/g; s/ /_/g;

Many style guidelines state anyway that we should avoid & and use the word 
"and".  An of course, as many mentioned, the best replacement for space is 
a _.

s/__+/_/g; s/---+/--/g;

This is kind of postprocessing.  The first one solves the problem of 
repeated underscores (or maybe spaces originally), and also repeated 
hyphens.

s/\xE2\x80\x99/-/g; # Single right quote

I am not sure where these are coming from.  It may be Unicode, but I that 
these codes are actually a single quote and I wanted to take care of it.

s/(=|[^\w.-])/"=".uc unpack("H2",$1)/ge;

Finally, I do not want to lose any information.  If someone sends me some 
new characters in a filename tomorrow, this will reveal their hex-codes, 
so I can continue evolving script. :)

Regards,
Vlado


More information about the nSLUG mailing list