Not sure if I'm going to hit a brick wall here, but figured I'd toss this question out in case someone else has needed to do this in the past.
I received a request to extract the document text from a number of historical documents. This extract will encompass several thousand documents when all is said and done. I ran a query that extracted the relevant information into a flat delimited file, including patient identifiers, document creation and signing data, and the DATA field from the DOCDATA table itself. I knew going in that the data was stored with the formatting information, and now hope that I can get it cleaned up for delivery.
For the life of me I can't figure out how to get the Data field formatted into plain text. I can't find static start and stop delimiters on the RTF code that doesn't leave outlying information (I even attempted using a regex expression: (?<=\{)[^}]+(?=\}) in np++ that managed to clean about half of the garbage, but still left a number of /par/blahblah/b0 type statements).
I've also tried the obvious solution of exporting to .rtf out of crystal to either copy/paste or convert to csv or txt with no luck.
I may just need something as simple as a smack in the head after staring at this all day today, but if anyone can throw any suggestions my way, I'm more than happy to try them out.
I took a DATA field from DOCDATA, saved it as an RTF file and passed the file to this Powershell function and the output TXT file looks ok to me.
https://raw.githubusercontent.com/Asnivor/PowerShell-Misc-Functions/master/translate-rtf-to-txt.ps1
Thank you for the reply.
I just tried that powershell script on a fresh extract of data to .rtf format and got this:
{ tf1ansiftnbj{fonttbl{f0 fswiss Arial;}}{colortbl ;
etc.
The script doesn't appear to have actively stripped any of the actual RTF code. Are you getting a different result? Does your docdata.data field contain similar input, or is it clean? I feel like I'm just missing a step somewhere.
Some new code - ran on a standard CPS Demo db.
Output is a gridview with rtf and plaintext side-by-sde
==============================
$Server = 'localhost'; $DB = 'demo'
$SQL = @"
select ddid, sdid, data from docdata order by ddid, sdid
"@
function Get-CPSData
{
[CmdletBinding()]
param (
[string]$Server,
[string]$Database,
[string]$selectcommand
)
$connectionString = "Server=$Server;Database=$Database;Trusted_Connection=yes;"
$CPSDBData = Get-DatabaseData -connectionstring $connectionstring -query $selectcommand
Return $CPSDBData
}
Function Get-DatabaseData
{
[CmdletBinding()]
param (
[string]$connectionString,
[string]$query
)
$connection = New-Object System.Data.SqlClient.SqlConnection
$connection.ConnectionString = $connectionString
$command = $connection.CreateCommand()
$command.CommandText = $query
$table = New-Object System.Data.DataTable
$adapter = New-Object System.Data.SqlClient.SqlDataAdapter $command
$adapter.Fill($table)
$connection.Close()
return $Table
}
$rtBox = New-Object System.Windows.Forms.RichTextBox
$list = @()
$Rows = Get-CPSData $Server $DB $SQL
$ErrorActionPreference = 'SilentlyContinue'
[int]$i = 0
foreach ($row in $rows)
{
Try
{
If (($row.data).length -gt 4)
{
$rtBox.Rtf = $row.data
$item = [psCustomobject]@{
Seq = $i++
DDID = "{0:F0}" -f $($row.ddid)
SDID = "{0:F0}" -f $($row.sdid)
RTFData = $rtBox.RTF
Plaintext = $rtBox.Text
}
$list += $item
}
}
Catch
{ }
}
$list | Out-GridView