Unix & Linux Stack Exchange is a question and answer site for users of Linux, FreeBSD and other Un*x-like operating systems. Join them; it only takes a minute:

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

This is a script related question, using bash with awk and/or sed with text recognition. So it might be off topic here.

I have a text document that has a load of text which has spaces between every letter!

Example:

T h e b o o k a l s o h a s a n a n a l y t i c a l p u r p o s e w h i c h i s m o r e i m p o r t a n t

Is there a way that I can get awk or sed to delete the spaces? I appreciate that this is probably a much more complex problem to solve with just a simple bash script as there needs to be some sort of text recognition also.

Grateful for any ideas as to how to approach this problem. Unfortunately this text document is massive and would take a very long time to manually go through it.

share|improve this question
    
it is trivial to replace all spaces with nothing.. but I think you'd want to separate the words? – sp asic 3 hours ago
    
for ex: echo 't h i s i s a n e x a m p l e' | sed 's/ //g' – sp asic 3 hours ago
    
That doesn't limit the change to spaces between letters. (Digits and punctuation aren't letters, for instance). You can do this in sed with a loop. This also is probably a duplicate. – Thomas Dickey 2 hours ago
    
As idea: try to add letter by letter in loop and check by something like look while get word, than go next – Costas 2 hours ago
    
to restrict only between letters: echo 'T h i s ; i s .a n 9 8 e x a m p l e' | perl -pe 's/[a-z]\K (?=[a-z])//ig' – sp asic 2 hours ago

Perl to the rescue!

You need a dictionary, i.e. a file listing one word per line. On my system, it exists as /var/lib/dict/words, I've also seen similar files as /usr/share/dict/british etc.

First, you remember all the words from the dictionary. Then, you read the input line by line, and try to add characters to a word. If it's possible, you remember the word and try to analyze the rest of the line. If you reach the end of the line, you output the line.

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

my $words = '/var/lib/dict/words';
my %word;

sub analyze {
    my ($chars, $words, $pos) = @_;
    if ($pos == @$chars) {
        $_[3] = 1;  # Found.
        say "@$words";
        return
    }
    for my $to ($pos .. $#$chars) {
        my $try = join q(), @$chars[ $pos .. $to ];
        if (exists $word{$try}) {
            analyze($chars, [ @$words, $try ], $to + 1, $_[3]);
        }
    }
}


open my $WORDS, '<', $words or die $!;
undef @word{ map { chomp; lc $_ } <$WORDS> };

while (<>) {
    my @chars = map lc, /\S/g;
    analyze(\@chars, [], 0, my $found = 0);
    warn "Unknown: $_" unless $found;
}

For your input, it generates 4092 possible readings on my system.

share|improve this answer
    
fails test with spaced out version of a cat a log ie a c a t a l o g – richard 1 hour ago
    
@richard: OBOE, fixed. But it now generates too many possibilites, try to remove one letter words. – choroba 1 hour ago

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.