Monday, June 13, 2011

Sum weights

I have a file with a bunch of sequences and some weights at the top of the file:

>WEIGHTS 0.926434 1.000000 1.000000 0.926434 1.000000 0.892712 1.000000 1.000000 1.000000 1.000000 1.000000 0.892712 
>CRTC_EUGGR__Q9ZNY3 Calreticulin precursor.
XRKELWXXXXXXXXXXXXXXXXXXXXXXXXXTRWTHSTXXSDYXKFKLTSGKFYGDKAKDAGIQTSQDAKFYAISSPIASXXSXEXXXLVLQFSVKHXXXXXXGXGXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXKXEPRCEXDTLSHTYXAXXXXDXXXEVLVDQVKKESGTLEEDWEILKPKTIPDPEDKKPADWVDEPDMVDPEDKKPEDWDKEPAQIPDPDATQPDDWDEEEDGKWEAPMISNPKYKGEWKAKKIPNPAYKGVWKPRDIPNPEYEADDKVXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXFYDQTNGATKDAEKKAFDSAEADKRKKEEDERKKQEEEEKKTAEEDEXXXDEXXXEDDKKDEL
>HSP47_RAT__P29457 47 kDa heat shock protein precursor (Collagen-binding protein 1) (GP46).
XRSLXXXXXXXXXXXXXXXXXXXXEAAAPGTAEKLSSKATTLAEXSXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXKXXXXSQAKAVLSAEKLRDEEVHTGLGELVRSLSNXTARNVTWKLGSXXXXXXXXSFADDFVRSSKQHYNCEHSKINFRDKRSALQSINEWASQTTDGKLPEVTKDVERTDXXLLXXAMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXYXYXXXXXXXXQXVEMXXXXXXXXXXXXXXXXXXXXXRLEKXXTKXQLKTWMGKMQKKAXXISLPXGVVXVTHDLQKXXAGLGLTEAIXKNKADLSXXSGXXXXXXXXXXXXXXXEWDTEGNPFDQDIYGRXXXRSXXXXXXXXXXXXXXXXXXXXXXXXIGRLXXXXGDKMRDEL
>ENPL_PIG__Q29092 Endoplasmin precursor (94 kDa glucose-regulated protein) (GRP94) (GP96 homolog) (98 kDa protein kinase) (PPK 98) (ppk98).
XRAXXXXXXXXXXXXXXXXXXXXEVDVDGTVEEDLGKSREGSRTDDEVVQREEEAIQLDGLNASQIRELREKSEKXAFXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXELTVKXKCDKEKNLLHVTDTGVGMTREELVKNLGTIAKSGTSEFLNKMAEAQEDGQSTSELIGXXXXXXXXXXXXXXXXXVTXXHNNDTQHIWESDSNEFSVIADPRGNTLGRGTTITLXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXSXKTETVEXPMXXXXAAKXEKEESDDEAAVXXXXXEKXPXTXKVEKTVWDWELMNDIKPIWQRPSKEVEDDEYKAFYKSFSXXXXXPMAYIHFTXXXXXXXXXILXXXXXXXXXLFDEYXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXLNVSREXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXGVIXDHXXXXRLAKLLRFQSSHHPSDITSLDQYVERMKEKQDKIYFMAGSSRKEAESSPFVEXXXXXXXXXXXXXXXXXXXXXQALPXXXXKRFQNVAKEGVKFDESEKSKENREAVEKEFEPLLNWMKDKALXDKIEKAVVSQRXXEXXXXLVASQYGWXGNXERIMKAQAYQTGKDISTNYYASQKKTFEINPRHPLIRDMLRRVKEDEDDKXXSDLXXXXXXXXXXXXXXLLPDTKAYXXRIERMLRLSLNIDPDAKVEXXPXXXPXXTTEDTTEDTEQDDDEEMDAGADEXXQXTSETSTAEKDEL
>CRTC_HUMAN__P27797 Calreticulin precursor (CRP55) (Calregulin) (HACBP) (ERp60) (grp60).
XLXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXTSXXIESXXXSDFXXFVLSSGKFYGDEEKDKGLQTSQDARFYALSASFEXXSXXXXXLVVXFXXKHXXXXXXGGGYVKLFPNSLDQTDMHGDSEYNIMFGPDIXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXTYEVKIDNSQVESGSLEDDWDFLPPKKIKDPDASKPEDWDERAKIDDPTDSKPEDWDKPEHIPDPDAKKPEDWDEEMDGEWEPPVIQNPEYKGEWKPRQIDNPDYKGTWIHPEIDNPEYSPDPSIYXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXAYAEEFGNETWGVTKAAEKQMKDKQDEEQRLKEEEEDKKRKEXXXAXDKEDXEXKXEDXXDXXDKXXDXXEDVPGQAKDEL


I want to sum the weights which is fine for this example with like 12 sequences. However, some of the files have a couple of hundred entries. Step in Bash:

 head -n 1 filename.txt | awk '{for (i=1; i<=NF; i++) s=s+$i}; END{print s}'

Giving me the sum of the weights! Woohoo.