A script that deletes spaces between letters in text

Question

This is a script related question, using bash with awk and/or sed with text recognition. So it might be off topic here.

I have a text document that has a load of text which has spaces between every letter!

Example:

T h e b o o k a l s o h a s a n a n a l y t i c a l p u r p o s e w h i c h i s m o r e i m p o r t a n t

Is there a way that I can get awk or sed to delete the spaces? I appreciate that this is probably a much more complex problem to solve with just a simple bash script as there needs to be some sort of text recognition also.

Grateful for any ideas as to how to approach this problem. Unfortunately this text document is massive and would take a very long time to manually go through it.

it is trivial to replace all spaces with nothing.. but I think you'd want to separate the words? — sp asic, 3 hours ago
for ex: echo 't h i s i s a n e x a m p l e' | sed 's/ //g' — sp asic, 3 hours ago
That doesn't limit the change to spaces between letters. (Digits and punctuation aren't letters, for instance). You can do this in sed with a loop. This also is probably a duplicate. — Thomas Dickey, 2 hours ago
As idea: try to add letter by letter in loop and check by something like look while get word, than go next — Costas, 2 hours ago
to restrict only between letters: echo 'T h i s ; i s .a n 9 8 e x a m p l e' | perl -pe 's/[a-z]\K (?=[a-z])//ig' — sp asic, 2 hours ago

choroba · Answer 1 · 2016-09-10 13:49:16Z

Perl to the rescue!

You need a dictionary, i.e. a file listing one word per line. On my system, it exists as /var/lib/dict/words, I've also seen similar files as /usr/share/dict/british etc.

First, you remember all the words from the dictionary. Then, you read the input line by line, and try to add characters to a word. If it's possible, you remember the word and try to analyze the rest of the line. If you reach the end of the line, you output the line.

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

my $words = '/var/lib/dict/words';
my %word;

sub analyze {
    my ($chars, $words, $pos) = @_;
    if ($pos == @$chars) {
        $_[3] = 1;  # Found.
        say "@$words";
        return
    }
    for my $to ($pos .. $#$chars) {
        my $try = join q(), @$chars[ $pos .. $to ];
        if (exists $word{$try}) {
            analyze($chars, [ @$words, $try ], $to + 1, $_[3]);
        }
    }
}


open my $WORDS, '<', $words or die $!;
undef @word{ map { chomp; lc $_ } <$WORDS> };

while (<>) {
    my @chars = map lc, /\S/g;
    analyze(\@chars, [], 0, my $found = 0);
    warn "Unknown: $_" unless $found;
}

For your input, it generates 4092 possible readings on my system.

fails test with spaced out version of a cat a log ie a c a t a l o g — richard, 1 hour ago
@richard: OBOE, fixed. But it now generates too many possibilites, try to remove one letter words. — choroba, 1 hour ago

asked	today
viewed	55 times
active	today

current community

your communities

more stack exchange communities

A script that deletes spaces between letters in text

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged bash sed awk scripting or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

A script that deletes spaces between letters in text

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged bash sed awk scripting or ask your own question.

Related

Hot Network Questions