Index

Subject : [lug] Digest (6 messages)

From : lug-owner@lists.ncsu.[redacted]

Date : Fri, 12 Sep 2014 13:02:20 -0400


The Lug Digest
Volume 1 : Issue 334 : "text" Format

Messages in this Issue:
201409/4 : [TriLUG-announce] Reminder: TriLUG Lightning Talks - Thurs 11
September
Bill Farrow <bill@arrowsreach.[redacted]>
201409/5 : Re: OCR Help
Brian Cottingham <spiffytech@gmail.[redacted]>
201409/6 : Re: OCR Help
Sean Mealin <spmealin@ncsu.[redacted]>
201409/7 : Re: OCR Help
Brian Fields <blfields@ncsu.[redacted]>
201409/9 : Re: OCR Help
Brian Pike <bapike@gmail.[redacted]>
201409/8 : [TriLUG-announce] Update: TriLUG Meeting Tonight (Sept 11) - Room
Change
Bill Farrow <bill@arrowsreach.[redacted]>

----------------------------------------------------------------------

Date: Wed, 10 Sep 2014 14:32:27 -0400
From: Bill Farrow <bill@arrowsreach.[redacted]>
To: TriLUG Announce <trilug-announce@trilug.[redacted]>,
TriLUG General List <trilug@trilug.[redacted]>
Subject: [TriLUG-announce] Reminder: TriLUG Lightning Talks - Thurs 11
September
Message-ID: <CAPm8Nr0XV2Mf2N4=GW-k=v+RkNr0jOhdJscbg0YXO_oJui7vJw@mail.gmail.[redacted]>

Reminder: The TriLUG meeting is tomorrow night !!

Topic: Lightning Talks
When: Thursday, 11th September 2014, 7pm (pizza from 6.45pm)
Where: NC State Engineering Building II Room 1021, Centennial Campus
Parking: The parking decks and Oval Drive street parking are free after 5pm
Website: http://trilug.org/2014-09-11/lightning

It's not too late to propose a talk... just fill out the "Sign Up
Form" on the webpage.

Sponsor
=======
This meeting is sponsored by Broadband.com - please thank them at the meeting.

Lightning Talks
============
1. Fedora.next - by John Dulaney
2. How to host your own dropbox/gcal/gcontacts! - by Sebastian
3. LilyPond - Music Typesetting on Linux - by Scott Miller
4. Sketching computer network diagrams on a computer - by Stanley Karunditu
5. Stupid pinging web app again - by William Chandler
6. Open Source Status Updates - by Lenore Ramm
7. Introduction to FIO - by Dwain Sims
8. Password Management with KeePassX - by Michael Hrivnak




Bill
--
This message was sent to: lug@lists.ncsu.[redacted] <lug@lists.ncsu.[redacted]>
To unsubscribe, send a blank message to trilug-announce-leave@trilug.[redacted] from that address.
TriLUG-announce mailing list : http://www.trilug.org/mailman/listinfo/trilug-announce
Unsubscribe or edit options on the web : http://www.trilug.org/mailman/options/trilug-announce/lug%40lists.ncsu.edu
TriLUG is dedicated to a harassment-free experience for everyone. Our anti-harassment policy can be found at: http://trilug.org/anti-harassment

------------------------------

Date: Wed, 10 Sep 2014 14:48:21 -0400
From: Brian Cottingham <spiffytech@gmail.[redacted]>
To: NCSU LUG <lug@lists.ncsu.[redacted]>
Subject: Re: OCR Help
Message-ID: <CAJEMKXMwKxUusMAj1WA2mNxpU4tibXY9RpT0Sw74MOBDJFyTzw@mail.gmail.[redacted]>

OCR has a hard enough time getting English right - I would be surprised if
an OCR program could read math equations reliably enough for you to do
homework, especially as you get further into matrix stuff with linear
algebra. You may have better luck finding some sort of a transcription
service (I'm sure there are pro services, but I'd also be curious to know
whether Amazon Mechanical Turk could handle this cheaply), or getting your
professor to send you the original files for their handouts.

On Mon, Sep 8, 2014 at 5:39 PM, Jeffery Mewtamer <mewtamer@gmail.[redacted]> wrote:

> Good Evening,
>
> I am a blind Linux user. As that relates to this message, it means I
> have to do most things from the command line and I often have to
> convert documents to plain text. For most formats I deal with on a
> regular basis, I've found ways to extract the text that I can use
> despite my disability, but I am having trouble working with documents
> whose content is largely image based. Poppler-utils's pdfimages
> command makes extracting images from PDF files, but I haven't found a
> good way of extracting images from other document formats. However,
> the more pressing issue at the moment is extracting text from images.
>
> I've been using cuneiform to perform Ocular Character Recognition on
> images, and while it works well enough on images with plain English
> text, it tends to produce gibberish when processing images that
> contain mathematical formulas and other math-related things, and
> handouts and homework for the math classes I'm taking seem to be the
> most common reason for me needing to do OCR.
>
> Attached are several .pbm images I need OCR performed on along with
> the output I got from cuneiform after cleaning it up a bit to
> illustrate how poorly it's meeting my OCR needs. I've also attached
> the bash scrip I use for processing many images at once. I've also
> tried Tesseract, but its more complicated command line format makes it
> harder to use and my tests seemed to indicate that cuneiform tended to
> have better results for less effort.
>
> Any suggestions for a command-line OCR program that could do a better
> job at making useable text-files from pages of scanned mathematics
> would be greatly appreciated.
>


[Attachment of type text/html removed.]

------------------------------

Date: Wed, 10 Sep 2014 15:41:39 -0400
From: Sean Mealin <spmealin@ncsu.[redacted]>
To: lug@lists.ncsu.[redacted]
Subject: Re: OCR Help
Message-ID: <CAF_3r-Fj1f8DG=db5_EpKOzbFiBr-SXgcC8QkPJ=1iki622-Tg@mail.gmail.[redacted]>

Hi,

Yes, OCR does not handle mathematical or scientific material very
well. Most engines are tuned to use standard grammatical constructs
to assist in the recognition, which tends to do very strange things to
equations.

I recommend that you speak to your professor, and attempt to get the
source material that the handouts are made from. Quite often
professors use LaTeX to generate the content, which you can pull the
info from since it is a text-based format.

In general, you want to have a good line of communication open with
your professors, since they will sometimes need to do things slightly
different from what they are accustom to in order to make the
information accessible to you. You will find that this is doubly
important as you progress through the more advanced technical courses.

Finally, don't be afraid to use the Disability Service Office (DSO)
when pursuing your goals, but also don't completely depend on them.
It is important to remember that you have the legal right to get
information in whatever format best works for you, so you shouldn't be
afraid to speak up. Sometimes it can be quite the battle to get
things into place, so you have to be passionate and verbal about what
you want. The DSO can give you general tips, but only you know what
works best for you when it comes to your classes.

As to your original question before I started ranting, unfortunately
as far as I am aware, there are not any freely available programs that
can handle technical content when doing OCR.

Sean


On 9/10/14, Brian Cottingham <spiffytech@gmail.[redacted]> wrote:
> OCR has a hard enough time getting English right - I would be surprised if
> an OCR program could read math equations reliably enough for you to do
> homework, especially as you get further into matrix stuff with linear
> algebra. You may have better luck finding some sort of a transcription
> service (I'm sure there are pro services, but I'd also be curious to know
> whether Amazon Mechanical Turk could handle this cheaply), or getting your
> professor to send you the original files for their handouts.
>
> On Mon, Sep 8, 2014 at 5:39 PM, Jeffery Mewtamer <mewtamer@gmail.[redacted]>
> wrote:
>
>> Good Evening,
>>
>> I am a blind Linux user. As that relates to this message, it means I
>> have to do most things from the command line and I often have to
>> convert documents to plain text. For most formats I deal with on a
>> regular basis, I've found ways to extract the text that I can use
>> despite my disability, but I am having trouble working with documents
>> whose content is largely image based. Poppler-utils's pdfimages
>> command makes extracting images from PDF files, but I haven't found a
>> good way of extracting images from other document formats. However,
>> the more pressing issue at the moment is extracting text from images.
>>
>> I've been using cuneiform to perform Ocular Character Recognition on
>> images, and while it works well enough on images with plain English
>> text, it tends to produce gibberish when processing images that
>> contain mathematical formulas and other math-related things, and
>> handouts and homework for the math classes I'm taking seem to be the
>> most common reason for me needing to do OCR.
>>
>> Attached are several .pbm images I need OCR performed on along with
>> the output I got from cuneiform after cleaning it up a bit to
>> illustrate how poorly it's meeting my OCR needs. I've also attached
>> the bash scrip I use for processing many images at once. I've also
>> tried Tesseract, but its more complicated command line format makes it
>> harder to use and my tests seemed to indicate that cuneiform tended to
>> have better results for less effort.
>>
>> Any suggestions for a command-line OCR program that could do a better
>> job at making useable text-files from pages of scanned mathematics
>> would be greatly appreciated.
>>
>


--
Sean Mealin
President - Computer Science Graduate Student Association
spmealin@ncsu.[redacted]
(XXX) 772-2507
http://www4.ncsu.edu/~spmealin/

------------------------------

Date: Thu, 11 Sep 2014 07:15:17 -0400
From: Brian Fields <blfields@ncsu.[redacted]>
To: lug@lists.ncsu.[redacted]
Subject: Re: OCR Help
Message-ID: <CAL8M5DRoBxy7-USUWBPQwCS4HQWwHKKpSnm025qK2TMxViNOgA@mail.gmail.[redacted]>

The only thing I found was a project called Infty, but it still looks
pretty experimental.
http://www.inftyproject.org/en/index.html



Brian

- Brian Fields
Systems Specialist
NC State University, CALS Information Technology
brian_fields@ncsu.[redacted]

All electronic mail messages in connection with State business which
are sent to or received by this account are subject to the NC Public
Records Law and may be disclosed to third parties.

On Wed, Sep 10, 2014 at 3:41 PM, Sean Mealin <spmealin@ncsu.[redacted]> wrote:

> Hi,
>
> Yes, OCR does not handle mathematical or scientific material very
> well. Most engines are tuned to use standard grammatical constructs
> to assist in the recognition, which tends to do very strange things to
> equations.
>
> I recommend that you speak to your professor, and attempt to get the
> source material that the handouts are made from. Quite often
> professors use LaTeX to generate the content, which you can pull the
> info from since it is a text-based format.
>
> In general, you want to have a good line of communication open with
> your professors, since they will sometimes need to do things slightly
> different from what they are accustom to in order to make the
> information accessible to you. You will find that this is doubly
> important as you progress through the more advanced technical courses.
>
> Finally, don't be afraid to use the Disability Service Office (DSO)
> when pursuing your goals, but also don't completely depend on them.
> It is important to remember that you have the legal right to get
> information in whatever format best works for you, so you shouldn't be
> afraid to speak up. Sometimes it can be quite the battle to get
> things into place, so you have to be passionate and verbal about what
> you want. The DSO can give you general tips, but only you know what
> works best for you when it comes to your classes.
>
> As to your original question before I started ranting, unfortunately
> as far as I am aware, there are not any freely available programs that
> can handle technical content when doing OCR.
>
> Sean
>
>
> On 9/10/14, Brian Cottingham <spiffytech@gmail.[redacted]> wrote:
> > OCR has a hard enough time getting English right - I would be surprised
> if
> > an OCR program could read math equations reliably enough for you to do
> > homework, especially as you get further into matrix stuff with linear
> > algebra. You may have better luck finding some sort of a transcription
> > service (I'm sure there are pro services, but I'd also be curious to know
> > whether Amazon Mechanical Turk could handle this cheaply), or getting
> your
> > professor to send you the original files for their handouts.
> >
> > On Mon, Sep 8, 2014 at 5:39 PM, Jeffery Mewtamer <mewtamer@gmail.[redacted]>
> > wrote:
> >
> >> Good Evening,
> >>
> >> I am a blind Linux user. As that relates to this message, it means I
> >> have to do most things from the command line and I often have to
> >> convert documents to plain text. For most formats I deal with on a
> >> regular basis, I've found ways to extract the text that I can use
> >> despite my disability, but I am having trouble working with documents
> >> whose content is largely image based. Poppler-utils's pdfimages
> >> command makes extracting images from PDF files, but I haven't found a
> >> good way of extracting images from other document formats. However,
> >> the more pressing issue at the moment is extracting text from images.
> >>
> >> I've been using cuneiform to perform Ocular Character Recognition on
> >> images, and while it works well enough on images with plain English
> >> text, it tends to produce gibberish when processing images that
> >> contain mathematical formulas and other math-related things, and
> >> handouts and homework for the math classes I'm taking seem to be the
> >> most common reason for me needing to do OCR.
> >>
> >> Attached are several .pbm images I need OCR performed on along with
> >> the output I got from cuneiform after cleaning it up a bit to
> >> illustrate how poorly it's meeting my OCR needs. I've also attached
> >> the bash scrip I use for processing many images at once. I've also
> >> tried Tesseract, but its more complicated command line format makes it
> >> harder to use and my tests seemed to indicate that cuneiform tended to
> >> have better results for less effort.
> >>
> >> Any suggestions for a command-line OCR program that could do a better
> >> job at making useable text-files from pages of scanned mathematics
> >> would be greatly appreciated.
> >>
> >
>
>
> --
> Sean Mealin
> President - Computer Science Graduate Student Association
> spmealin@ncsu.[redacted]
> (XXX) 772-2507
> http://www4.ncsu.edu/~spmealin/
>


[Attachment of type text/html removed.]

------------------------------

Date: Fri, 12 Sep 2014 13:02:16 -0400
From: Brian Pike <bapike@gmail.[redacted]>
To: lug@lists.ncsu.[redacted]
Subject: Re: OCR Help
Message-ID: <CAFpBnz081jantkZ8yma6_OZJ1JkEh5SDJ3YkMWVaXtdsVhuK_w@mail.gmail.[redacted]>

Hi Jeffery,
Sean's right that most math instructors will use LaTeX to write
assignments and exams. Unfortunately, I think the documents you sent
look like someone used Microsoft Word with the Microsoft Equation
Editor (or something similar) to write the documents, printed them
out, drew in some square brackets around matrices, and then scanned
them in. That would be challenging to OCR. Even math journals
haven't figured out how to OCR old issues accurately.

Also, having graded linear algebra I can say that a small change in a
linear algebra problem may produce a dramatically different solution,
and that's hard to grade fairly. I suggest that you use the DSO, or
use OCR and then have someone (say, your TA or a friend) double-check
the result. It would take your TA less time to check all the numbers
before you start working than to check all of your work in a
wrongly-OCRed problem.

Brian

------------------------------

Date: Thu, 11 Sep 2014 08:57:04 -0400
From: Bill Farrow <bill@arrowsreach.[redacted]>
To: TriLUG Announce <trilug-announce@trilug.[redacted]>,
TriLUG General List <trilug@trilug.[redacted]>
Subject: [TriLUG-announce] Update: TriLUG Meeting Tonight (Sept 11) - Room
Change
Message-ID: <CAPm8Nr38p=siSHDD1dh7cGBQQ-cuNEDLYGOt2P+p--uek9w8sA@mail.gmail.[redacted]>

Update: Last minute change of location.

The meeting has been moved to NCSU Eng Building 1 Room 1007. There
will be signs posted at the original meeting site to let people know.
Engineering Building 1 is to your right (west) as you walk through the
Eng Bld 2 archway. Refer to the map on the meeting webpage.

http://trilug.org/2014-09-11/lightning

Pizza and drinks will start at 6:45pm, thanks to tonight's sponsor,
Broadband.com.

Bill

On Wed, Sep 10, 2014 at 2:32 PM, Bill Farrow <bill@arrowsreach.[redacted]> wrote:
> Reminder: The TriLUG meeting is tomorrow night !!
>
> Topic: Lightning Talks
> When: Thursday, 11th September 2014, 7pm (pizza from 6.45pm)
--
This message was sent to: lug@lists.ncsu.[redacted] <lug@lists.ncsu.[redacted]>
To unsubscribe, send a blank message to trilug-announce-leave@trilug.[redacted] from that address.
TriLUG-announce mailing list : http://www.trilug.org/mailman/listinfo/trilug-announce
Unsubscribe or edit options on the web : http://www.trilug.org/mailman/options/trilug-announce/lug%40lists.ncsu.edu
TriLUG is dedicated to a harassment-free experience for everyone. Our anti-harassment policy can be found at: http://trilug.org/anti-harassment

------------------------------

End of [lug] Digest (6 messages)
**********