Tuesday, April 18, 2006
Removing bad line breaks in text files
Have you ever opened up a text file in a VI editor and you see this character at the end of this line: ^M
For example...
In many instances, these characters don’t cause a problem (e.g. your browser, an editor) but they do cause problems for CSV or XML files or anything that needs to be parsed, because they will not be validated correctly.
They occur because some programs are not consistent in the way they insert the line breaks so you end up with some lines that have both a carrage return and a ^M and some lines that have a ^M and no carrage return. It is often encountered between editors on DOS and *nix. For example if you use a text editor on your windows machine, it may not produce line breaks the same way VI would produce them on a *nix server. I had heard one time that this can also happen when FTPing a file from your machine to a *nix server when using ASCII mode and that that the safest way is to zip up files and FTP as binary – but I couldn’t find anything to substantiate that.
The solution to removing these in VI is to run the following substitution
:s/^M//g
***NOTE***: you have to enter the "^M" as "CTRL-V CTRL-M" and not "CARET M". This "^M" is a special character that needs special handling. The substitution above will remove all the ^M characters and replace them with nothing.
Also, if you don’t want to use VI, you use shell + perl to remove these characters on from the command line. Again, you can’t cut/paste these commands because you need to type ^M as “CTRL-V CTRL-M” as described above.
For example...
<?xml version = "1.0" encoding = "windows-1252"?>^M
<web-app>^M
<description>Empty web.xml file for Web Application</description>^M
<context-param>^M
<param-name>DBUrl</param-name>^M
<param-value>jdbc:oracle:thin:@ora:1521:ora</param-value>^M
</context-param>^M
<context-param>^M
In many instances, these characters don’t cause a problem (e.g. your browser, an editor) but they do cause problems for CSV or XML files or anything that needs to be parsed, because they will not be validated correctly.
They occur because some programs are not consistent in the way they insert the line breaks so you end up with some lines that have both a carrage return and a ^M and some lines that have a ^M and no carrage return. It is often encountered between editors on DOS and *nix. For example if you use a text editor on your windows machine, it may not produce line breaks the same way VI would produce them on a *nix server. I had heard one time that this can also happen when FTPing a file from your machine to a *nix server when using ASCII mode and that that the safest way is to zip up files and FTP as binary – but I couldn’t find anything to substantiate that.
The solution to removing these in VI is to run the following substitution
:s/^M//g
***NOTE***: you have to enter the "^M" as "CTRL-V CTRL-M" and not "CARET M". This "^M" is a special character that needs special handling. The substitution above will remove all the ^M characters and replace them with nothing.
Also, if you don’t want to use VI, you use shell + perl to remove these characters on from the command line. Again, you can’t cut/paste these commands because you need to type ^M as “CTRL-V CTRL-M” as described above.
# recurse a directory, removing ^M from all files
find . -type f -exec perl -pi -e "s/^M//g" {} \;
# remove ^M from a specific file
perl -pi -e "s/^M//g" somefile.txt
# remove ^M from a bunch of files
perl -pi -e "s/^M//g" */*.java *.txt