I.T. Spices The LINUX Way

Python In The Shell: The STEEMIT Ecosystem – Post #117

SCRAPING ALL BLOGS USING PYTHON – THE POST BODY

Please refer to Post #110 for the complete python script and the intro of this series, link below:
https://steemit.com/blockchain/@lightingmacsteem/2rydxz-i-t-spices-the-linux-way

In this long post we will be discussing how we arrived at the BODY of the blog post, said body contains the TEXTS, the VIDEO link/s and the IMAGE link/s; of course all of which if applicable.

Lines 109 to 147 is laid out below, take note of the VIDEOS, IMAGES and TEXT PARAGRAPHS portions:

109
110         #BODY
111         print('\nBODY:')
112         flogs.write('\nBODY:')
113         par = soup.findAll('p')
114         vid = soup.findAll('div', {'class':'videoWrapper youtube'})
115         img = soup.findAll('img')
116         #FOR YOUTUBE VIDEOS
117         if vid != []:
118             print('\n' + '   VIDEO link/s:')
119             flogs.write('\n' + '   VIDEO link/s:')
120             for v in vid:
121                 jpglink = re.search('\(.+?\)', str(v))
122                 imglink = jpglink.string[jpglink.start():jpglink.end()]
123                 vidlink = os.popen('echo "' + str(imglink) + '" | cut --delimiter=/ -f5').read()
124                 print('       https://www.youtube.com/watch?v=' + str(vidlink))
125                 flogs.write('\n       https://www.youtube.com/watch?v=' + str(vidlink))
126         #FOR IMAGES
127         if img != []:
128             print('\n' + '   IMAGE link/s:')
129             flogs.write('\n' + '   IMAGE link/s:')
130             for i in img:
131                 imagelink = re.search('\".+?\"', str(i))
132                 imageurl = imagelink.string[imagelink.start():imagelink.end()]
133                 print('       ' + str(imageurl))
134                 flogs.write('\n       ' + str(imageurl))
135         #PARAGRAPHS TEXT
136         if par != []:
137             print('\n' + '   TEXT Paragraphs:')
138             flogs.write('\n' + '   TEXT Paragraphs:')
139             for sss in par:
140                 body = (sss.text).replace('\n', '\n    ')
141                 print('       ' + body)
142                 flogs.write('\n       ' + body)
143             with open('/dev/shm/steemblogs/samplesoup', 'a+') as f:
145                 f.write('\n##############################################################################################################################\n')
146                 f.write(soup.prettify())
147                 f.write('\n##############################################################################################################################\n')



Lines 111 to 112 prints the part which the python program is processing, this time the BODY.

Lines 113 to 115 are variables declaring the three parts of a BLOG post, the VIDEO link, the IMAGE link and the TEXT PARAGRAPHS. Said lines are looking for clues if there are any using the BeautifulSoup module.

Lines 116 to 125 is now zooming in on the VIDEOS, with an IF statement as a starting point to make sure that it will only process such if there are links found; of course no need to waste our time for something that returns zero results.

Further to the VIDEO portion, take note of line 121. This is a python line that looks for a string of text inside the () characters. It uses the re module and if there are any found the python program memorizes it for further processing. The VIDEO id is in such string of characters disguised as an image file, and if we can get this id we can then reconstruct the full URL of the video by combining the fixed characters “https://www.youtube.com/watch?v='” and the captured id of the image.

Lines 126 to 134 deals with IMAGE links contained n the BODY, of course if applicable. The IF statement at line 127 will see to it that we only waste our time further finding images if the said img variable found a match which will mean a positive result.

Further to the IMAGE portion, take note of line 131 as the python re module searches for a text string inside the “” characters. If found, any text string inside such double-quote characters will be memorized by python signifying a positive result for an image link.

Lines 135 to 147 deals with the PARAGRAPHS OF TEXTS as one writes a blog post. And since not all blogs contain paragraphs of words, once again at line 136 we declared thru the IF statement that we only go further in this if there is a positive result.

Further to the PARAGRAPHS portion and in line 140, take note that we are looking for texts inside the par variable named as sss, said par variable is already declared earlier as a BeautifulSoup formatted python strings. Anything texts inside such par variable are surely words of alphabets that can only mean one thing, the text paragraphs of a blog.

Take note that if no data found in these portions (VIDEO or IMAGE or PARAGRAPHS), the code is arranged to not display it anymore, even the title lines.

Cleaner, better, much faster. I think that’s the best way to sort out Post Office letters of old, just my take.


”Destiny Is If One Finds A Letter For Him Among Thousands.”

Sort:  

Congratulations @lightingmacsteem! You have completed the following achievement on the Steem blockchain and have been rewarded with new badge(s) :

You received more than 250 as payout for your posts. Your next target is to reach a total payout of 500

Click here to view your Board
If you no longer want to receive notifications, reply to this comment with the word STOP

Support SteemitBoard's project! Vote for its witness and get one more award!

Hello @lightingmacsteem! This is a friendly reminder that you have 3000 Partiko Points unclaimed in your Partiko account!

Partiko is a fast and beautiful mobile app for Steem, and it’s the most popular Steem mobile app out there! Download Partiko using the link below and login using SteemConnect to claim your 3000 Partiko points! You can easily convert them into Steem token!

https://partiko.app/referral/partiko

Congratulations @lightingmacsteem! You received a personal award!

Happy Birthday! - You are on the Steem blockchain for 2 years!

You can view your badges on your Steem Board and compare to others on the Steem Ranking

Vote for @Steemitboard as a witness to get one more award and increased upvotes!

Coin Marketplace

STEEM 0.33
TRX 0.11
JST 0.035
BTC 67020.94
ETH 3270.13
USDT 1.00
SBD 4.62